Commit graph

9533 commits

Author SHA1 Message Date
Greg Kroah-Hartman
654c66e990 This is the 4.19.100 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4xqFEACgkQONu9yGCS
 aT4pWhAAnBOvHPDEBjQzrvrhQAEZkT421Plew1Z1E/RL0CgbYisuaUMmNhGppfAu
 MFt3TXt7sbQ9XfzfRGhiuY3jv2pOBiKlu3DdmKanLGjeCXGSPFPXf+UL/m4utD2F
 /XvtWTQwOakrghfJn93iF01nF6pSc7IIe7hBgptyc0C0TZXvPy7FC03JxCiepW/8
 XEsXYbth6jTEaWwwFZf/QK9sYh7BThm/CmK8UsIdG1kZMW8I9jpAp+1m2DqCB4Je
 KACR4IEfWGEvipw8r0tCDjbSeo8LlKkbb3Kiz/yPZemECX/MEeN3ErGLQT+eut5a
 G6Bs5QJfgoYaq3/XjhRp3IQhM8OFEFe9Z0rm7mikRbKDmPp24f9cQ8OUVDEUSBWK
 zaTi5U7K5jEAI1/PNn2ZSKWMKya2AP5awX48jV2e6bHDo6AXK/JPr2omu2WqpT9f
 SbPa47cFmgR11oFWpCmLFG//sL5oB5djP1blAhnMxExVlzBpMOmJi4jCd461hTaA
 mQ+E5WDOMoRTxC2G+SoBoyYprsNK0PIPZilIs6hPz9kSJan7EXKeqfA5ZFjwY3mW
 tb+kS5XpllfWvajAkWskZpre2NITUV3ybIywm8pPaX2OSvC3zodaglHm76yj32Vk
 qkfS3psVnpQTKepip5a1y6pkB951jE4hx/zofmW3tMivQGOFSCU=
 =dyoL
 -----END PGP SIGNATURE-----

Merge 4.19.100 into android-4.19

Changes in 4.19.100
	can, slip: Protect tty->disc_data in write_wakeup and close with RCU
	firestream: fix memory leaks
	gtp: make sure only SOCK_DGRAM UDP sockets are accepted
	ipv6: sr: remove SKB_GSO_IPXIP6 on End.D* actions
	net: bcmgenet: Use netif_tx_napi_add() for TX NAPI
	net: cxgb3_main: Add CAP_NET_ADMIN check to CHELSIO_GET_MEM
	net: ip6_gre: fix moving ip6gre between namespaces
	net, ip6_tunnel: fix namespaces move
	net, ip_tunnel: fix namespaces move
	net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()
	net_sched: fix datalen for ematch
	net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject
	net-sysfs: fix netdev_queue_add_kobject() breakage
	net-sysfs: Call dev_hold always in netdev_queue_add_kobject
	net-sysfs: Call dev_hold always in rx_queue_add_kobject
	net-sysfs: Fix reference count leak
	net: usb: lan78xx: Add .ndo_features_check
	Revert "udp: do rmem bulk free even if the rx sk queue is empty"
	tcp_bbr: improve arithmetic division in bbr_update_bw()
	tcp: do not leave dangling pointers in tp->highest_sack
	tun: add mutex_unlock() call and napi.skb clearing in tun_get_user()
	afs: Fix characters allowed into cell names
	hwmon: (adt7475) Make volt2reg return same reg as reg2volt input
	hwmon: (core) Do not use device managed functions for memory allocations
	PCI: Mark AMD Navi14 GPU rev 0xc5 ATS as broken
	tracing: trigger: Replace unneeded RCU-list traversals
	Input: keyspan-remote - fix control-message timeouts
	Revert "Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers"
	ARM: 8950/1: ftrace/recordmcount: filter relocation types
	mmc: tegra: fix SDR50 tuning override
	mmc: sdhci: fix minimum clock rate for v3 controller
	Documentation: Document arm64 kpti control
	Input: pm8xxx-vib - fix handling of separate enable register
	Input: sur40 - fix interface sanity checks
	Input: gtco - fix endpoint sanity check
	Input: aiptek - fix endpoint sanity check
	Input: pegasus_notetaker - fix endpoint sanity check
	Input: sun4i-ts - add a check for devm_thermal_zone_of_sensor_register
	netfilter: nft_osf: add missing check for DREG attribute
	hwmon: (nct7802) Fix voltage limits to wrong registers
	scsi: RDMA/isert: Fix a recently introduced regression related to logout
	tracing: xen: Ordered comparison of function pointers
	do_last(): fetch directory ->i_mode and ->i_uid before it's too late
	net/sonic: Add mutual exclusion for accessing shared state
	net/sonic: Clear interrupt flags immediately
	net/sonic: Use MMIO accessors
	net/sonic: Fix interface error stats collection
	net/sonic: Fix receive buffer handling
	net/sonic: Avoid needless receive descriptor EOL flag updates
	net/sonic: Improve receive descriptor status flag check
	net/sonic: Fix receive buffer replenishment
	net/sonic: Quiesce SONIC before re-initializing descriptor memory
	net/sonic: Fix command register usage
	net/sonic: Fix CAM initialization
	net/sonic: Prevent tx watchdog timeout
	tracing: Use hist trigger's var_ref array to destroy var_refs
	tracing: Remove open-coding of hist trigger var_ref management
	tracing: Fix histogram code when expression has same var as value
	sd: Fix REQ_OP_ZONE_REPORT completion handling
	crypto: geode-aes - switch to skcipher for cbc(aes) fallback
	coresight: etb10: Do not call smp_processor_id from preemptible
	coresight: tmc-etf: Do not call smp_processor_id from preemptible
	libertas: Fix two buffer overflows at parsing bss descriptor
	media: v4l2-ioctl.c: zero reserved fields for S/TRY_FMT
	scsi: iscsi: Avoid potential deadlock in iscsi_if_rx func
	netfilter: ipset: use bitmap infrastructure completely
	netfilter: nf_tables: add __nft_chain_type_get()
	net/x25: fix nonblocking connect
	mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
	mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section()
	mm, sparse: pass nid instead of pgdat to sparse_add_one_section()
	drivers/base/memory.c: remove an unnecessary check on NR_MEM_SECTIONS
	mm, memory_hotplug: add nid parameter to arch_remove_memory
	mm/memory_hotplug: release memory resource after arch_remove_memory()
	drivers/base/memory.c: clean up relics in function parameters
	mm, memory_hotplug: update a comment in unregister_memory()
	mm/memory_hotplug: make unregister_memory_section() never fail
	mm/memory_hotplug: make __remove_section() never fail
	powerpc/mm: Fix section mismatch warning
	mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail
	s390x/mm: implement arch_remove_memory()
	mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE
	drivers/base/memory: pass a block_id to init_memory_block()
	mm/memory_hotplug: create memory block devices after arch_add_memory()
	mm/memory_hotplug: remove memory block devices before arch_remove_memory()
	mm/memory_hotplug: make unregister_memory_block_under_nodes() never fail
	mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section
	mm/hotplug: kill is_dev_zone() usage in __remove_pages()
	drivers/base/node.c: simplify unregister_memory_block_under_nodes()
	mm/memunmap: don't access uninitialized memmap in memunmap_pages()
	mm/memory_hotplug: fix try_offline_node()
	mm/memory_hotplug: shrink zones when offlining memory
	Linux 4.19.100

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I1664d6d4de9358bff5632c291a26e1401ec7b5f1
2020-01-29 17:10:45 +01:00
Eric Dumazet
9bbde08258 tcp: do not leave dangling pointers in tp->highest_sack
[ Upstream commit 2bec445f9bf35e52e395b971df48d3e1e5dc704a ]

Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
apparently allowed syzbot to trigger various crashes in TCP stack [1]

I believe this commit only made things easier for syzbot to find
its way into triggering use-after-frees. But really the bugs
could lead to bad TCP behavior or even plain crashes even for
non malicious peers.

I have audited all calls to tcp_rtx_queue_unlink() and
tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
if we are removing from rtx queue the skb that tp->highest_sack points to.

These updates were missing in three locations :

1) tcp_clean_rtx_queue() [This one seems quite serious,
                          I have no idea why this was not caught earlier]

2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]

3) tcp_send_synack()     [Probably not a big deal for normal operations]

[1]
BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16

CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
 kasan_report+0x12/0x20 mm/kasan/common.c:639
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
 tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
 tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
 tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
 tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
 tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
 tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
 tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
 tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
 tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
 ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
 __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
 __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
 process_backlog+0x206/0x750 net/core/dev.c:6093
 napi_poll net/core/dev.c:6530 [inline]
 net_rx_action+0x508/0x1120 net/core/dev.c:6598
 __do_softirq+0x262/0x98c kernel/softirq.c:292
 run_ksoftirqd kernel/softirq.c:603 [inline]
 run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
 smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
 kthread+0x361/0x430 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 10091:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 __kasan_kmalloc mm/kasan/common.c:513 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
 slab_post_alloc_hook mm/slab.h:584 [inline]
 slab_alloc_node mm/slab.c:3263 [inline]
 kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
 alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
 sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
 sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
 tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:672
 __sys_sendto+0x262/0x380 net/socket.c:1998
 __do_sys_sendto net/socket.c:2010 [inline]
 __se_sys_sendto net/socket.c:2006 [inline]
 __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 10095:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 kasan_set_free_info mm/kasan/common.c:335 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
 __cache_free mm/slab.c:3426 [inline]
 kmem_cache_free+0x86/0x320 mm/slab.c:3694
 kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
 __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
 sk_eat_skb include/net/sock.h:2453 [inline]
 tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
 inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:886 [inline]
 sock_recvmsg net/socket.c:904 [inline]
 sock_recvmsg+0xce/0x110 net/socket.c:900
 __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
 __do_sys_recvfrom net/socket.c:2073 [inline]
 __se_sys_recvfrom net/socket.c:2069 [inline]
 __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880a488d040
 which belongs to the cache skbuff_fclone_cache of size 456
The buggy address is located 40 bytes inside of
 456-byte region [ffff8880a488d040, ffff8880a488d208)
The buggy address belongs to the page:
page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                          ^
 ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
Fixes: 50895b9de1 ("tcp: highest_sack fix")
Fixes: 737ff31456 ("tcp: use sequence distance to detect reordering")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cambda Zhu <cambda@linux.alibaba.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29 16:43:17 +01:00
Wen Yang
33dba56493 tcp_bbr: improve arithmetic division in bbr_update_bw()
[ Upstream commit 5b2f1f3070b6447b76174ea8bfb7390dc6253ebd ]

do_div() does a 64-by-32 division. Use div64_long() instead of it
if the divisor is long, to avoid truncation to 32-bit.
And as a nice side effect also cleans up the function a bit.

Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29 16:43:17 +01:00
Paolo Abeni
4c1c35c015 Revert "udp: do rmem bulk free even if the rx sk queue is empty"
[ Upstream commit d39ca2590d10712f412add7a88e1dd467a7246f4 ]

This reverts commit 0d4a6608f6.

Willem reported that after commit 0d4a6608f6 ("udp: do rmem bulk
free even if the rx sk queue is empty") the memory allocated by
an almost idle system with many UDP sockets can grow a lot.

For stable kernel keep the solution as simple as possible and revert
the offending commit.

Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Diagnosed-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 0d4a6608f6 ("udp: do rmem bulk free even if the rx sk queue is empty")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29 16:43:17 +01:00
William Dauchy
1d3b53f716 net, ip_tunnel: fix namespaces move
[ Upstream commit d0f418516022c32ecceaf4275423e5bd3f8743a9 ]

in the same manner as commit 690afc165bb3 ("net: ip6_gre: fix moving
ip6gre between namespaces"), fix namespace moving as it was broken since
commit 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.").
Indeed, the ip6_gre commit removed the local flag for collect_md
condition, so there is no reason to keep it for ip_gre/ip_tunnel.

this patch will fix both ip_tunnel and ip_gre modules.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29 16:43:15 +01:00
Greg Kroah-Hartman
1fca2c99f4 This is the 4.19.99 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4u6tsACgkQONu9yGCS
 aT693A//TExeDRnNnf+2v4TJorylyRr17BMxk/Ie2L5E6d2n/RWodsrOThAPU9tx
 5alNUkXCT8Jd31BUVnUoPoAQ4zSymSVi++XEf05wDeO0tQ982IESGaLmu9EC1uMF
 nnM5y4IdRYmFI1Zji4h5vRJckoYUlB6Mdg4BgMr4Q1KX7RkZYfe6bjs7DwM/uyMx
 jVXdFaQBD1H6F5W6A+GmgUZ36g9uNqzcBxxWwv5URj+q816NdI4bsxIJMF0v0WC+
 S54fmpS07QWIYKKsQBUepeSgEF4ECESOE2VoF1ICcnfakdPnDBmNgyPJPSrLmVf+
 itRUxoH1MewaOvoJrv+xsGBPmM29LcKH2oBmj5DR2Xstp7ACPs+OtXJEU9dUTDN4
 NhaSts5fIp0f4Y5mMn508pDUwYDAWDt99ZJWdx6aK/TRyUsHBgpxBQDt37BE3U5W
 PCBnObNe2b2KDAsVXLjX5iDYoA0+usFreveMo8uEP+ohfh0ANvJlRkzedYw7NquI
 ZCcT+I1P9q8aa0528tR332VLrQeYg+kG6LVi2kAabmRA/VtEsT0w90MY/eo2vuTU
 WlPmbs2yerv2HTm050e6MOgBZfPh7wP/FpbjsSXufj7EDywlfxF+1hXdwfrpPJeN
 fN3g0kepeUp7+kLzO40FLam/z5ndjAUhyN2SBaPzGsXjMkZdETk=
 =zvlh
 -----END PGP SIGNATURE-----

Merge 4.19.99 into android-4.19

Changes in 4.19.99
	Revert "efi: Fix debugobjects warning on 'efi_rts_work'"
	xfs: Sanity check flags of Q_XQUOTARM call
	i2c: stm32f7: rework slave_id allocation
	i2c: i2c-stm32f7: fix 10-bits check in slave free id search loop
	mfd: intel-lpss: Add default I2C device properties for Gemini Lake
	SUNRPC: Fix svcauth_gss_proxy_init()
	powerpc/pseries: Enable support for ibm,drc-info property
	powerpc/archrandom: fix arch_get_random_seed_int()
	tipc: update mon's self addr when node addr generated
	tipc: fix wrong timeout input for tipc_wait_for_cond()
	mt7601u: fix bbp version check in mt7601u_wait_bbp_ready
	crypto: sun4i-ss - fix big endian issues
	perf map: No need to adjust the long name of modules
	soc: aspeed: Fix snoop_file_poll()'s return type
	watchdog: sprd: Fix the incorrect pointer getting from driver data
	ipmi: Fix memory leak in __ipmi_bmc_register
	drm/sti: do not remove the drm_bridge that was never added
	ARM: dts: at91: nattis: set the PRLUD and HIPOW signals low
	ARM: dts: at91: nattis: make the SD-card slot work
	ixgbe: don't clear IPsec sa counters on HW clearing
	drm/virtio: fix bounds check in virtio_gpu_cmd_get_capset()
	iio: fix position relative kernel version
	apparmor: Fix network performance issue in aa_label_sk_perm
	ALSA: hda: fix unused variable warning
	apparmor: don't try to replace stale label in ptrace access check
	ARM: qcom_defconfig: Enable MAILBOX
	firmware: coreboot: Let OF core populate platform device
	PCI: iproc: Remove PAXC slot check to allow VF support
	bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER
	drm/hisilicon: hibmc: Don't overwrite fb helper surface depth
	signal/ia64: Use the generic force_sigsegv in setup_frame
	signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
	ASoC: wm9712: fix unused variable warning
	mailbox: mediatek: Add check for possible failure of kzalloc
	IB/rxe: replace kvfree with vfree
	IB/hfi1: Add mtu check for operational data VLs
	genirq/debugfs: Reinstate full OF path for domain name
	usb: dwc3: add EXTCON dependency for qcom
	usb: gadget: fsl_udc_core: check allocation return value and cleanup on failure
	cfg80211: regulatory: make initialization more robust
	mei: replace POLL* with EPOLL* for write queues.
	drm/msm: fix unsigned comparison with less than zero
	of: Fix property name in of_node_get_device_type
	ALSA: usb-audio: update quirk for B&W PX to remove microphone
	iwlwifi: nvm: get num of hw addresses from firmware
	staging: comedi: ni_mio_common: protect register write overflow
	netfilter: nft_osf: usage from output path is not valid
	pwm: lpss: Release runtime-pm reference from the driver's remove callback
	powerpc/pseries/memory-hotplug: Fix return value type of find_aa_index
	rtlwifi: rtl8821ae: replace _rtl8821ae_mrate_idx_to_arfr_id with generic version
	RDMA/bnxt_re: Add missing spin lock initialization
	netfilter: nf_flow_table: do not remove offload when other netns's interface is down
	powerpc/kgdb: add kgdb_arch_set/remove_breakpoint()
	tipc: eliminate message disordering during binding table update
	net: socionext: Add dummy PHY register read in phy_write()
	drm/sun4i: hdmi: Fix double flag assignation
	net: hns3: add error handler for hns3_nic_init_vector_data()
	mlxsw: reg: QEEC: Add minimum shaper fields
	mlxsw: spectrum: Set minimum shaper on MC TCs
	NTB: ntb_hw_idt: replace IS_ERR_OR_NULL with regular NULL checks
	ASoC: wm97xx: fix uninitialized regmap pointer problem
	ARM: dts: bcm283x: Correct mailbox register sizes
	pcrypt: use format specifier in kobject_add
	ASoC: sun8i-codec: add missing route for ADC
	pinctrl: meson-gxl: remove invalid GPIOX tsin_a pins
	bus: ti-sysc: Add mcasp optional clocks flag
	exportfs: fix 'passing zero to ERR_PTR()' warning
	drm: rcar-du: Fix the return value in case of error in 'rcar_du_crtc_set_crc_source()'
	drm: rcar-du: Fix vblank initialization
	net: always initialize pagedlen
	drm/dp_mst: Skip validating ports during destruction, just ref
	arm64: dts: meson-gx: Add hdmi_5v regulator as hdmi tx supply
	arm64: dts: renesas: r8a7795-es1: Add missing power domains to IPMMU nodes
	net: phy: Fix not to call phy_resume() if PHY is not attached
	IB/hfi1: Correctly process FECN and BECN in packets
	OPP: Fix missing debugfs supply directory for OPPs
	IB/rxe: Fix incorrect cache cleanup in error flow
	mailbox: ti-msgmgr: Off by one in ti_msgmgr_of_xlate()
	staging: bcm2835-camera: Abort probe if there is no camera
	staging: bcm2835-camera: fix module autoloading
	switchtec: Remove immediate status check after submitting MRPC command
	ipv6: add missing tx timestamping on IPPROTO_RAW
	pinctrl: sh-pfc: r8a7740: Add missing REF125CK pin to gether_gmii group
	pinctrl: sh-pfc: r8a7740: Add missing LCD0 marks to lcd0_data24_1 group
	pinctrl: sh-pfc: r8a7791: Remove bogus ctrl marks from qspi_data4_b group
	pinctrl: sh-pfc: r8a7791: Remove bogus marks from vin1_b_data18 group
	pinctrl: sh-pfc: sh73a0: Add missing TO pin to tpu4_to3 group
	pinctrl: sh-pfc: r8a7794: Remove bogus IPSR9 field
	pinctrl: sh-pfc: r8a77970: Add missing MOD_SEL0 field
	pinctrl: sh-pfc: r8a77980: Add missing MOD_SEL0 field
	pinctrl: sh-pfc: sh7734: Add missing IPSR11 field
	pinctrl: sh-pfc: r8a77995: Remove bogus SEL_PWM[0-3]_3 configurations
	pinctrl: sh-pfc: sh7269: Add missing PCIOR0 field
	pinctrl: sh-pfc: sh7734: Remove bogus IPSR10 value
	net: hns3: fix error handling int the hns3_get_vector_ring_chain
	vxlan: changelink: Fix handling of default remotes
	Input: nomadik-ske-keypad - fix a loop timeout test
	fork,memcg: fix crash in free_thread_stack on memcg charge fail
	clk: highbank: fix refcount leak in hb_clk_init()
	clk: qoriq: fix refcount leak in clockgen_init()
	clk: ti: fix refcount leak in ti_dt_clocks_register()
	clk: socfpga: fix refcount leak
	clk: samsung: exynos4: fix refcount leak in exynos4_get_xom()
	clk: imx6q: fix refcount leak in imx6q_clocks_init()
	clk: imx6sx: fix refcount leak in imx6sx_clocks_init()
	clk: imx7d: fix refcount leak in imx7d_clocks_init()
	clk: vf610: fix refcount leak in vf610_clocks_init()
	clk: armada-370: fix refcount leak in a370_clk_init()
	clk: kirkwood: fix refcount leak in kirkwood_clk_init()
	clk: armada-xp: fix refcount leak in axp_clk_init()
	clk: mv98dx3236: fix refcount leak in mv98dx3236_clk_init()
	clk: dove: fix refcount leak in dove_clk_init()
	MIPS: BCM63XX: drop unused and broken DSP platform device
	arm64: defconfig: Re-enable bcm2835-thermal driver
	remoteproc: qcom: q6v5-mss: Add missing clocks for MSM8996
	remoteproc: qcom: q6v5-mss: Add missing regulator for MSM8996
	drm: Fix error handling in drm_legacy_addctx
	ARM: dts: r8a7743: Remove generic compatible string from iic3
	drm/etnaviv: fix some off by one bugs
	drm/fb-helper: generic: Fix setup error path
	fork, memcg: fix cached_stacks case
	IB/usnic: Fix out of bounds index check in query pkey
	RDMA/ocrdma: Fix out of bounds index check in query pkey
	RDMA/qedr: Fix out of bounds index check in query pkey
	drm/shmob: Fix return value check in shmob_drm_probe
	arm64: dts: apq8016-sbc: Increase load on l11 for SDCARD
	spi: cadence: Correct initialisation of runtime PM
	RDMA/iw_cxgb4: Fix the unchecked ep dereference
	net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ9031
	memory: tegra: Don't invoke Tegra30+ specific memory timing setup on Tegra20
	drm/etnaviv: NULL vs IS_ERR() buf in etnaviv_core_dump()
	media: s5p-jpeg: Correct step and max values for V4L2_CID_JPEG_RESTART_INTERVAL
	kbuild: mark prepare0 as PHONY to fix external module build
	crypto: brcm - Fix some set-but-not-used warning
	crypto: tgr192 - fix unaligned memory access
	ASoC: imx-sgtl5000: put of nodes if finding codec fails
	IB/iser: Pass the correct number of entries for dma mapped SGL
	net: hns3: fix wrong combined count returned by ethtool -l
	media: tw9910: Unregister subdevice with v4l2-async
	IB/mlx5: Don't override existing ip_protocol
	rtc: cmos: ignore bogus century byte
	spi/topcliff_pch: Fix potential NULL dereference on allocation error
	net: hns3: fix bug of ethtool_ops.get_channels for VF
	ARM: dts: sun8i-a23-a33: Move NAND controller device node to sort by address
	clk: sunxi-ng: sun8i-a23: Enable PLL-MIPI LDOs when ungating it
	iwlwifi: mvm: avoid possible access out of array.
	net/mlx5: Take lock with IRQs disabled to avoid deadlock
	ip_tunnel: Fix route fl4 init in ip_md_tunnel_xmit
	arm64: dts: allwinner: h6: Move GIC device node fix base address ordering
	iwlwifi: mvm: fix A-MPDU reference assignment
	bus: ti-sysc: Fix timer handling with drop pm_runtime_irq_safe()
	tty: ipwireless: Fix potential NULL pointer dereference
	driver: uio: fix possible memory leak in __uio_register_device
	driver: uio: fix possible use-after-free in __uio_register_device
	crypto: crypto4xx - Fix wrong ppc4xx_trng_probe()/ppc4xx_trng_remove() arguments
	driver core: Fix DL_FLAG_AUTOREMOVE_SUPPLIER device link flag handling
	driver core: Avoid careless re-use of existing device links
	driver core: Do not resume suppliers under device_links_write_lock()
	driver core: Fix handling of runtime PM flags in device_link_add()
	driver core: Do not call rpm_put_suppliers() in pm_runtime_drop_link()
	ARM: dts: lpc32xx: add required clocks property to keypad device node
	ARM: dts: lpc32xx: reparent keypad controller to SIC1
	ARM: dts: lpc32xx: fix ARM PrimeCell LCD controller variant
	ARM: dts: lpc32xx: fix ARM PrimeCell LCD controller clocks property
	ARM: dts: lpc32xx: phy3250: fix SD card regulator voltage
	drm/xen-front: Fix mmap attributes for display buffers
	iwlwifi: mvm: fix RSS config command
	staging: most: cdev: add missing check for cdev_add failure
	clk: ingenic: jz4740: Fix gating of UDC clock
	rtc: ds1672: fix unintended sign extension
	thermal: mediatek: fix register index error
	arm64: dts: msm8916: remove bogus argument to the cpu clock
	ath10k: fix dma unmap direction for management frames
	net: phy: fixed_phy: Fix fixed_phy not checking GPIO
	rtc: ds1307: rx8130: Fix alarm handling
	net/smc: original socket family in inet_sock_diag
	rtc: 88pm860x: fix unintended sign extension
	rtc: 88pm80x: fix unintended sign extension
	rtc: pm8xxx: fix unintended sign extension
	fbdev: chipsfb: remove set but not used variable 'size'
	iw_cxgb4: use tos when importing the endpoint
	iw_cxgb4: use tos when finding ipv6 routes
	ipmi: kcs_bmc: handle devm_kasprintf() failure case
	xsk: add missing smp_rmb() in xsk_mmap
	drm/etnaviv: potential NULL dereference
	ntb_hw_switchtec: debug print 64bit aligned crosslink BAR Numbers
	ntb_hw_switchtec: NT req id mapping table register entry number should be 512
	pinctrl: sh-pfc: emev2: Add missing pinmux functions
	pinctrl: sh-pfc: r8a7791: Fix scifb2_data_c pin group
	pinctrl: sh-pfc: r8a7792: Fix vin1_data18_b pin group
	pinctrl: sh-pfc: sh73a0: Fix fsic_spdif pin groups
	RDMA/mlx5: Fix memory leak in case we fail to add an IB device
	driver core: Fix possible supplier PM-usage counter imbalance
	PCI: endpoint: functions: Use memcpy_fromio()/memcpy_toio()
	usb: phy: twl6030-usb: fix possible use-after-free on remove
	block: don't use bio->bi_vcnt to figure out segment number
	keys: Timestamp new keys
	net: dsa: b53: Fix default VLAN ID
	net: dsa: b53: Properly account for VLAN filtering
	net: dsa: b53: Do not program CPU port's PVID
	mt76: usb: fix possible memory leak in mt76u_buf_free
	media: sh: migor: Include missing dma-mapping header
	vfio_pci: Enable memory accesses before calling pci_map_rom
	hwmon: (pmbus/tps53679) Fix driver info initialization in probe routine
	mdio_bus: Fix PTR_ERR() usage after initialization to constant
	KVM: PPC: Release all hardware TCE tables attached to a group
	staging: r8822be: check kzalloc return or bail
	dmaengine: mv_xor: Use correct device for DMA API
	cdc-wdm: pass return value of recover_from_urb_loss
	brcmfmac: create debugfs files for bus-specific layer
	regulator: pv88060: Fix array out-of-bounds access
	regulator: pv88080: Fix array out-of-bounds access
	regulator: pv88090: Fix array out-of-bounds access
	net: dsa: qca8k: Enable delay for RGMII_ID mode
	net/mlx5: Delete unused FPGA QPN variable
	drm/nouveau/bios/ramcfg: fix missing parentheses when calculating RON
	drm/nouveau/pmu: don't print reply values if exec is false
	drm/nouveau: fix missing break in switch statement
	driver core: Fix PM-runtime for links added during consumer probe
	ASoC: qcom: Fix of-node refcount unbalance in apq8016_sbc_parse_of()
	net: dsa: fix unintended change of bridge interface STP state
	fs/nfs: Fix nfs_parse_devname to not modify it's argument
	staging: rtlwifi: Use proper enum for return in halmac_parse_psd_data_88xx
	powerpc/64s: Fix logic when handling unknown CPU features
	NFS: Fix a soft lockup in the delegation recovery code
	perf: Copy parent's address filter offsets on clone
	perf, pt, coresight: Fix address filters for vmas with non-zero offset
	clocksource/drivers/sun5i: Fail gracefully when clock rate is unavailable
	clocksource/drivers/exynos_mct: Fix error path in timer resources initialization
	platform/x86: wmi: fix potential null pointer dereference
	NFS/pnfs: Bulk destroy of layouts needs to be safe w.r.t. umount
	mmc: sdhci-brcmstb: handle mmc_of_parse() errors during probe
	iommu: Fix IOMMU debugfs fallout
	ARM: 8847/1: pm: fix HYP/SVC mode mismatch when MCPM is used
	ARM: 8848/1: virt: Align GIC version check with arm64 counterpart
	ARM: 8849/1: NOMMU: Fix encodings for PMSAv8's PRBAR4/PRLAR4
	regulator: wm831x-dcdc: Fix list of wm831x_dcdc_ilim from mA to uA
	ath10k: Fix length of wmi tlv command for protected mgmt frames
	netfilter: nft_set_hash: fix lookups with fixed size hash on big endian
	netfilter: nft_set_hash: bogus element self comparison from deactivation path
	net: sched: act_csum: Fix csum calc for tagged packets
	hwrng: bcm2835 - fix probe as platform device
	iommu/vt-d: Fix NULL pointer reference in intel_svm_bind_mm()
	NFS: Add missing encode / decode sequence_maxsz to v4.2 operations
	NFSv4/flexfiles: Fix invalid deref in FF_LAYOUT_DEVID_NODE()
	net: aquantia: fixed instack structure overflow
	powerpc/mm: Check secondary hash page table
	media: dvb/earth-pt1: fix wrong initialization for demod blocks
	rbd: clear ->xferred on error from rbd_obj_issue_copyup()
	PCI: Fix "try" semantics of bus and slot reset
	nios2: ksyms: Add missing symbol exports
	x86/mm: Remove unused variable 'cpu'
	scsi: megaraid_sas: reduce module load time
	nfp: fix simple vNIC mailbox length
	drivers/rapidio/rio_cm.c: fix potential oops in riocm_ch_listen()
	xen, cpu_hotplug: Prevent an out of bounds access
	net/mlx5: Fix multiple updates of steering rules in parallel
	net/mlx5e: IPoIB, Fix RX checksum statistics update
	net: sh_eth: fix a missing check of of_get_phy_mode
	regulator: lp87565: Fix missing register for LP87565_BUCK_0
	soc: amlogic: gx-socinfo: Add mask for each SoC packages
	media: ivtv: update *pos correctly in ivtv_read_pos()
	media: cx18: update *pos correctly in cx18_read_pos()
	media: wl128x: Fix an error code in fm_download_firmware()
	media: cx23885: check allocation return
	regulator: tps65086: Fix tps65086_ldoa1_ranges for selector 0xB
	crypto: ccree - reduce kernel stack usage with clang
	jfs: fix bogus variable self-initialization
	tipc: tipc clang warning
	m68k: mac: Fix VIA timer counter accesses
	ARM: dts: sun8i: a33: Reintroduce default pinctrl muxing
	arm64: dts: allwinner: a64: Add missing PIO clocks
	ARM: dts: sun9i: optimus: Fix fixed-regulators
	net: phy: don't clear BMCR in genphy_soft_reset
	ARM: OMAP2+: Fix potentially uninitialized return value for _setup_reset()
	net: dsa: Avoid null pointer when failing to connect to PHY
	soc: qcom: cmd-db: Fix an error code in cmd_db_dev_probe()
	media: davinci-isif: avoid uninitialized variable use
	media: tw5864: Fix possible NULL pointer dereference in tw5864_handle_frame
	spi: tegra114: clear packed bit for unpacked mode
	spi: tegra114: fix for unpacked mode transfers
	spi: tegra114: terminate dma and reset on transfer timeout
	spi: tegra114: flush fifos
	spi: tegra114: configure dma burst size to fifo trig level
	bus: ti-sysc: Fix sysc_unprepare() when no clocks have been allocated
	soc/fsl/qe: Fix an error code in qe_pin_request()
	spi: bcm2835aux: fix driver to not allow 65535 (=-1) cs-gpios
	drm/fb-helper: generic: Call drm_client_add() after setup is done
	arm64/vdso: don't leak kernel addresses
	rtc: Fix timestamp value for RTC_TIMESTAMP_BEGIN_1900
	rtc: mt6397: Don't call irq_dispose_mapping.
	ehea: Fix a copy-paste err in ehea_init_port_res
	bpf: Add missed newline in verifier verbose log
	drm/vmwgfx: Remove set but not used variable 'restart'
	scsi: qla2xxx: Unregister chrdev if module initialization fails
	of: use correct function prototype for of_overlay_fdt_apply()
	net/sched: cbs: fix port_rate miscalculation
	clk: qcom: Skip halt checks on gcc_pcie_0_pipe_clk for 8998
	ACPI: button: reinitialize button state upon resume
	firmware: arm_scmi: fix of_node leak in scmi_mailbox_check
	rxrpc: Fix detection of out of order acks
	scsi: target/core: Fix a race condition in the LUN lookup code
	brcmfmac: fix leak of mypkt on error return path
	ARM: pxa: ssp: Fix "WARNING: invalid free of devm_ allocated data"
	PCI: rockchip: Fix rockchip_pcie_ep_assert_intx() bitwise operations
	net: hns3: fix for vport->bw_limit overflow problem
	hwmon: (w83627hf) Use request_muxed_region for Super-IO accesses
	perf/core: Fix the address filtering fix
	staging: android: vsoc: fix copy_from_user overrun
	PCI: dwc: Fix dw_pcie_ep_find_capability() to return correct capability offset
	soc: amlogic: meson-gx-pwrc-vpu: Fix power on/off register bitmask
	platform/x86: alienware-wmi: fix kfree on potentially uninitialized pointer
	tipc: set sysctl_tipc_rmem and named_timeout right range
	usb: typec: tcpm: Notify the tcpc to start connection-detection for SRPs
	selftests/ipc: Fix msgque compiler warnings
	net: hns3: fix loop condition of hns3_get_tx_timeo_queue_info()
	powerpc: vdso: Make vdso32 installation conditional in vdso_install
	ARM: dts: ls1021: Fix SGMII PCS link remaining down after PHY disconnect
	media: ov2659: fix unbalanced mutex_lock/unlock
	6lowpan: Off by one handling ->nexthdr
	dmaengine: axi-dmac: Don't check the number of frames for alignment
	ALSA: usb-audio: Handle the error from snd_usb_mixer_apply_create_quirk()
	afs: Fix AFS file locking to allow fine grained locks
	afs: Further fix file locking
	NFS: Don't interrupt file writeout due to fatal errors
	coresight: catu: fix clang build warning
	s390/kexec_file: Fix potential segment overlap in ELF loader
	irqchip/gic-v3-its: fix some definitions of inner cacheability attributes
	scsi: qla2xxx: Fix a format specifier
	scsi: qla2xxx: Fix error handling in qlt_alloc_qfull_cmd()
	scsi: qla2xxx: Avoid that qlt_send_resp_ctio() corrupts memory
	KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest
	netfilter: nft_flow_offload: add entry to flowtable after confirmation
	PCI: iproc: Enable iProc config read for PAXBv2
	ARM: dts: logicpd-som-lv: Fix MMC1 card detect
	packet: in recvmsg msg_name return at least sizeof sockaddr_ll
	ASoC: fix valid stream condition
	usb: gadget: fsl: fix link error against usb-gadget module
	dwc2: gadget: Fix completed transfer size calculation in DDMA
	IB/mlx5: Add missing XRC options to QP optional params mask
	RDMA/rxe: Consider skb reserve space based on netdev of GID
	iommu/vt-d: Make kernel parameter igfx_off work with vIOMMU
	net: ena: fix swapped parameters when calling ena_com_indirect_table_fill_entry
	net: ena: fix: Free napi resources when ena_up() fails
	net: ena: fix incorrect test of supported hash function
	net: ena: fix ena_com_fill_hash_function() implementation
	dmaengine: tegra210-adma: restore channel status
	watchdog: rtd119x_wdt: Fix remove function
	mmc: core: fix possible use after free of host
	lightnvm: pblk: fix lock order in pblk_rb_tear_down_check
	ath10k: Fix encoding for protected management frames
	afs: Fix the afs.cell and afs.volume xattr handlers
	vfio/mdev: Avoid release parent reference during error path
	vfio/mdev: Follow correct remove sequence
	vfio/mdev: Fix aborting mdev child device removal if one fails
	l2tp: Fix possible NULL pointer dereference
	ALSA: aica: Fix a long-time build breakage
	media: omap_vout: potential buffer overflow in vidioc_dqbuf()
	media: davinci/vpbe: array underflow in vpbe_enum_outputs()
	platform/x86: alienware-wmi: printing the wrong error code
	crypto: caam - fix caam_dump_sg that iterates through scatterlist
	netfilter: ebtables: CONFIG_COMPAT: reject trailing data after last rule
	pwm: meson: Consider 128 a valid pre-divider
	pwm: meson: Don't disable PWM when setting duty repeatedly
	ARM: riscpc: fix lack of keyboard interrupts after irq conversion
	nfp: bpf: fix static check error through tightening shift amount adjustment
	kdb: do a sanity check on the cpu in kdb_per_cpu()
	netfilter: nf_tables: correct NFT_LOGLEVEL_MAX value
	backlight: lm3630a: Return 0 on success in update_status functions
	thermal: rcar_gen3_thermal: fix interrupt type
	thermal: cpu_cooling: Actually trace CPU load in thermal_power_cpu_get_power
	EDAC/mc: Fix edac_mc_find() in case no device is found
	afs: Fix key leak in afs_release() and afs_evict_inode()
	afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set
	afs: Fix lock-wait/callback-break double locking
	afs: Fix double inc of vnode->cb_break
	ARM: dts: sun8i-h3: Fix wifi in Beelink X2 DT
	clk: meson: gxbb: no spread spectrum on mpll0
	clk: meson: axg: spread spectrum is on mpll2
	dmaengine: tegra210-adma: Fix crash during probe
	arm64: dts: meson: libretech-cc: set eMMC as removable
	RDMA/qedr: Fix incorrect device rate.
	spi: spi-fsl-spi: call spi_finalize_current_message() at the end
	crypto: ccp - fix AES CFB error exposed by new test vectors
	crypto: ccp - Fix 3DES complaint from ccp-crypto module
	serial: stm32: fix word length configuration
	serial: stm32: fix rx error handling
	serial: stm32: fix rx data length when parity enabled
	serial: stm32: fix transmit_chars when tx is stopped
	serial: stm32: Add support of TC bit status check
	serial: stm32: fix wakeup source initialization
	misc: sgi-xp: Properly initialize buf in xpc_get_rsvd_page_pa
	iommu: Add missing new line for dma type
	iommu: Use right function to get group for device
	signal/bpfilter: Fix bpfilter_kernl to use send_sig not force_sig
	signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig
	inet: frags: call inet_frags_fini() after unregister_pernet_subsys()
	net: hns3: fix a memory leak issue for hclge_map_unmap_ring_to_vf_vector
	crypto: talitos - fix AEAD processing.
	netvsc: unshare skb in VF rx handler
	net: core: support XDP generic on stacked devices.
	RDMA/uverbs: check for allocation failure in uapi_add_elm()
	net: don't clear sock->sk early to avoid trouble in strparser
	phy: qcom-qusb2: fix missing assignment of ret when calling clk_prepare_enable
	cpufreq: brcmstb-avs-cpufreq: Fix initial command check
	cpufreq: brcmstb-avs-cpufreq: Fix types for voltage/frequency
	clk: sunxi-ng: sun50i-h6-r: Fix incorrect W1 clock gate register
	media: vivid: fix incorrect assignment operation when setting video mode
	crypto: inside-secure - fix zeroing of the request in ahash_exit_inv
	crypto: inside-secure - fix queued len computation
	arm64: dts: renesas: ebisu: Remove renesas, no-ether-link property
	mpls: fix warning with multi-label encap
	serial: stm32: fix a recursive locking in stm32_config_rs485
	arm64: dts: meson-gxm-khadas-vim2: fix gpio-keys-polled node
	arm64: dts: meson-gxm-khadas-vim2: fix Bluetooth support
	iommu/vt-d: Duplicate iommu_resv_region objects per device list
	phy: usb: phy-brcm-usb: Remove sysfs attributes upon driver removal
	firmware: arm_scmi: fix bitfield definitions for SENSOR_DESC attributes
	firmware: arm_scmi: update rate_discrete in clock_describe_rates_get
	ntb_hw_switchtec: potential shift wrapping bug in switchtec_ntb_init_sndev()
	ASoC: meson: axg-tdmin: right_j is not supported
	ASoC: meson: axg-tdmout: right_j is not supported
	qed: iWARP - Use READ_ONCE and smp_store_release to access ep->state
	qed: iWARP - fix uninitialized callback
	powerpc/cacheinfo: add cacheinfo_teardown, cacheinfo_rebuild
	powerpc/pseries/mobility: rebuild cacheinfo hierarchy post-migration
	bpf: fix the check that forwarding is enabled in bpf_ipv6_fib_lookup
	IB/hfi1: Handle port down properly in pio
	drm/msm/mdp5: Fix mdp5_cfg_init error return
	net: netem: fix backlog accounting for corrupted GSO frames
	net/udp_gso: Allow TX timestamp with UDP GSO
	net/af_iucv: build proper skbs for HiperTransport
	net/af_iucv: always register net_device notifier
	ASoC: ti: davinci-mcasp: Fix slot mask settings when using multiple AXRs
	rtc: pcf8563: Fix interrupt trigger method
	rtc: pcf8563: Clear event flags and disable interrupts before requesting irq
	ARM: dts: iwg20d-q7-common: Fix SDHI1 VccQ regularor
	net/sched: cbs: Fix error path of cbs_module_init
	arm64: dts: allwinner: h6: Pine H64: Add interrupt line for RTC
	drm/msm/a3xx: remove TPL1 regs from snapshot
	ip6_fib: Don't discard nodes with valid routing information in fib6_locate_1()
	perf/ioctl: Add check for the sample_period value
	dmaengine: hsu: Revert "set HSU_CH_MTSR to memory width"
	clk: qcom: Fix -Wunused-const-variable
	nvmem: imx-ocotp: Ensure WAIT bits are preserved when setting timing
	nvmem: imx-ocotp: Change TIMING calculation to u-boot algorithm
	tools: bpftool: use correct argument in cgroup errors
	backlight: pwm_bl: Fix heuristic to determine number of brightness levels
	fork,memcg: alloc_thread_stack_node needs to set tsk->stack
	bnxt_en: Fix ethtool selftest crash under error conditions.
	bnxt_en: Suppress error messages when querying DSCP DCB capabilities.
	iommu/amd: Make iommu_disable safer
	mfd: intel-lpss: Release IDA resources
	rxrpc: Fix uninitialized error code in rxrpc_send_data_packet()
	xprtrdma: Fix use-after-free in rpcrdma_post_recvs
	um: Fix IRQ controller regression on console read
	PM: ACPI/PCI: Resume all devices during hibernation
	ACPI: PM: Simplify and fix PM domain hibernation callbacks
	ACPI: PM: Introduce "poweroff" callbacks for ACPI PM domain and LPSS
	fsi/core: Fix error paths on CFAM init
	devres: allow const resource arguments
	fsi: sbefifo: Don't fail operations when in SBE IPL state
	RDMA/hns: Fixs hw access invalid dma memory error
	PCI: mobiveil: Remove the flag MSI_FLAG_MULTI_PCI_MSI
	PCI: mobiveil: Fix devfn check in mobiveil_pcie_valid_device()
	PCI: mobiveil: Fix the valid check for inbound and outbound windows
	ceph: fix "ceph.dir.rctime" vxattr value
	net: pasemi: fix an use-after-free in pasemi_mac_phy_init()
	net/tls: fix socket wmem accounting on fallback with netem
	x86/pgtable/32: Fix LOWMEM_PAGES constant
	xdp: fix possible cq entry leak
	ARM: stm32: use "depends on" instead of "if" after prompt
	scsi: libfc: fix null pointer dereference on a null lport
	xfrm interface: ifname may be wrong in logs
	drm/panel: make drm_panel.h self-contained
	clk: sunxi-ng: v3s: add the missing PLL_DDR1
	PM: sleep: Fix possible overflow in pm_system_cancel_wakeup()
	libertas_tf: Use correct channel range in lbtf_geo_init
	qed: reduce maximum stack frame size
	usb: host: xhci-hub: fix extra endianness conversion
	media: rcar-vin: Clean up correct notifier in error path
	mic: avoid statically declaring a 'struct device'.
	x86/kgbd: Use NMI_VECTOR not APIC_DM_NMI
	crypto: ccp - Reduce maximum stack usage
	ALSA: aoa: onyx: always initialize register read value
	arm64: dts: renesas: r8a77995: Fix register range of display node
	tipc: reduce risk of wakeup queue starvation
	ARM: dts: stm32: add missing vdda-supply to adc on stm32h743i-eval
	net/mlx5: Fix mlx5_ifc_query_lag_out_bits
	cifs: fix rmmod regression in cifs.ko caused by force_sig changes
	iio: tsl2772: Use devm_add_action_or_reset for tsl2772_chip_off
	net: fix bpf_xdp_adjust_head regression for generic-XDP
	spi: bcm-qspi: Fix BSPI QUAD and DUAL mode support when using flex mode
	cxgb4: smt: Add lock for atomic_dec_and_test
	crypto: caam - free resources in case caam_rng registration failed
	ext4: set error return correctly when ext4_htree_store_dirent fails
	RDMA/hns: Bugfix for slab-out-of-bounds when unloading hip08 driver
	RDMA/hns: bugfix for slab-out-of-bounds when loading hip08 driver
	ASoC: es8328: Fix copy-paste error in es8328_right_line_controls
	ASoC: cs4349: Use PM ops 'cs4349_runtime_pm'
	ASoC: wm8737: Fix copy-paste error in wm8737_snd_controls
	net/rds: Add a few missing rds_stat_names entries
	tools: bpftool: fix arguments for p_err() in do_event_pipe()
	tools: bpftool: fix format strings and arguments for jsonw_printf()
	drm: rcar-du: lvds: Fix bridge_to_rcar_lvds
	bnxt_en: Fix handling FRAG_ERR when NVM_INSTALL_UPDATE cmd fails
	signal: Allow cifs and drbd to receive their terminating signals
	powerpc/64s/radix: Fix memory hot-unplug page table split
	ASoC: sun4i-i2s: RX and TX counter registers are swapped
	dmaengine: dw: platform: Switch to acpi_dma_controller_register()
	rtc: rv3029: revert error handling patch to rv3029_eeprom_write()
	mac80211: minstrel_ht: fix per-group max throughput rate initialization
	i40e: reduce stack usage in i40e_set_fc
	media: atmel: atmel-isi: fix timeout value for stop streaming
	ARM: 8896/1: VDSO: Don't leak kernel addresses
	rtc: pcf2127: bugfix: read rtc disables watchdog
	mips: avoid explicit UB in assignment of mips_io_port_base
	media: em28xx: Fix exception handling in em28xx_alloc_urbs()
	iommu/mediatek: Fix iova_to_phys PA start for 4GB mode
	ahci: Do not export local variable ahci_em_messages
	rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
	Partially revert "kfifo: fix kfifo_alloc() and kfifo_init()"
	hwmon: (lm75) Fix write operations for negative temperatures
	net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
	power: supply: Init device wakeup after device_add()
	x86, perf: Fix the dependency of the x86 insn decoder selftest
	staging: greybus: light: fix a couple double frees
	irqdomain: Add the missing assignment of domain->fwnode for named fwnode
	bcma: fix incorrect update of BCMA_CORE_PCI_MDIO_DATA
	usb: typec: tps6598x: Fix build error without CONFIG_REGMAP_I2C
	bcache: Fix an error code in bch_dump_read()
	iio: dac: ad5380: fix incorrect assignment to val
	netfilter: ctnetlink: honor IPS_OFFLOAD flag
	ath9k: dynack: fix possible deadlock in ath_dynack_node_{de}init
	wcn36xx: use dynamic allocation for large variables
	tty: serial: fsl_lpuart: Use appropriate lpuart32_* I/O funcs
	ARM: dts: aspeed-g5: Fixe gpio-ranges upper limit
	xsk: avoid store-tearing when assigning queues
	xsk: avoid store-tearing when assigning umem
	led: triggers: Fix dereferencing of null pointer
	net: sonic: return NETDEV_TX_OK if failed to map buffer
	net: hns3: fix error VF index when setting VLAN offload
	rtlwifi: Fix file release memory leak
	ARM: dts: logicpd-som-lv: Fix i2c2 and i2c3 Pin mux
	f2fs: fix wrong error injection path in inc_valid_block_count()
	f2fs: fix error path of f2fs_convert_inline_page()
	scsi: fnic: fix msix interrupt allocation
	Btrfs: fix hang when loading existing inode cache off disk
	Btrfs: fix inode cache waiters hanging on failure to start caching thread
	Btrfs: fix inode cache waiters hanging on path allocation failure
	btrfs: use correct count in btrfs_file_write_iter()
	ixgbe: sync the first fragment unconditionally
	hwmon: (shtc1) fix shtc1 and shtw1 id mask
	net: sonic: replace dev_kfree_skb in sonic_send_packet
	pinctrl: iproc-gpio: Fix incorrect pinconf configurations
	gpio/aspeed: Fix incorrect number of banks
	ath10k: adjust skb length in ath10k_sdio_mbox_rx_packet
	RDMA/cma: Fix false error message
	net/rds: Fix 'ib_evt_handler_call' element in 'rds_ib_stat_names'
	um: Fix off by one error in IRQ enumeration
	bnxt_en: Increase timeout for HWRM_DBG_COREDUMP_XX commands
	f2fs: fix to avoid accessing uninitialized field of inode page in is_alive()
	mailbox: qcom-apcs: fix max_register value
	clk: actions: Fix factor clk struct member access
	powerpc/mm/mce: Keep irqs disabled during lockless page table walk
	bpf: fix BTF limits
	crypto: hisilicon - Matching the dma address for dma_pool_free()
	iommu/amd: Wait for completion of IOTLB flush in attach_device
	net: aquantia: Fix aq_vec_isr_legacy() return value
	cxgb4: Signedness bug in init_one()
	net: hisilicon: Fix signedness bug in hix5hd2_dev_probe()
	net: broadcom/bcmsysport: Fix signedness in bcm_sysport_probe()
	net: netsec: Fix signedness bug in netsec_probe()
	net: socionext: Fix a signedness bug in ave_probe()
	net: stmmac: dwmac-meson8b: Fix signedness bug in probe
	net: axienet: fix a signedness bug in probe
	of: mdio: Fix a signedness bug in of_phy_get_and_connect()
	net: nixge: Fix a signedness bug in nixge_probe()
	net: ethernet: stmmac: Fix signedness bug in ipq806x_gmac_of_parse()
	net: sched: cbs: Avoid division by zero when calculating the port rate
	nvme: retain split access workaround for capability reads
	net: stmmac: gmac4+: Not all Unicast addresses may be available
	rxrpc: Fix trace-after-put looking at the put connection record
	mac80211: accept deauth frames in IBSS mode
	llc: fix another potential sk_buff leak in llc_ui_sendmsg()
	llc: fix sk_buff refcounting in llc_conn_state_process()
	ip6erspan: remove the incorrect mtu limit for ip6erspan
	net: stmmac: fix length of PTP clock's name string
	net: stmmac: fix disabling flexible PPS output
	sctp: add chunks to sk_backlog when the newsk sk_socket is not set
	s390/qeth: Fix error handling during VNICC initialization
	s390/qeth: Fix initialization of vnicc cmd masks during set online
	act_mirred: Fix mirred_init_module error handling
	net: avoid possible false sharing in sk_leave_memory_pressure()
	net: add {READ|WRITE}_ONCE() annotations on ->rskq_accept_head
	tcp: annotate lockless access to tcp_memory_pressure
	net/smc: receive returns without data
	net/smc: receive pending data after RCV_SHUTDOWN
	drm/msm/dsi: Implement reset correctly
	vhost/test: stop device before reset
	dmaengine: imx-sdma: fix size check for sdma script_number
	firmware: dmi: Fix unlikely out-of-bounds read in save_mem_devices
	arm64: hibernate: check pgd table allocation
	net: netem: fix error path for corrupted GSO frames
	net: netem: correct the parent's backlog when corrupted packet was dropped
	xsk: Fix registration of Rx-only sockets
	bpf, offload: Unlock on error in bpf_offload_dev_create()
	afs: Fix missing timeout reset
	net: qca_spi: Move reset_count to struct qcaspi
	hv_netvsc: Fix offset usage in netvsc_send_table()
	hv_netvsc: Fix send_table offset in case of a host bug
	afs: Fix large file support
	drm: panel-lvds: Potential Oops in probe error handling
	hwrng: omap3-rom - Fix missing clock by probing with device tree
	dpaa_eth: perform DMA unmapping before read
	dpaa_eth: avoid timestamp read on error paths
	MIPS: Loongson: Fix return value of loongson_hwmon_init
	hv_netvsc: flag software created hash value
	net: neigh: use long type to store jiffies delta
	packet: fix data-race in fanout_flow_is_huge()
	i2c: stm32f7: report dma error during probe
	mmc: sdio: fix wl1251 vendor id
	mmc: core: fix wl1251 sdio quirks
	affs: fix a memory leak in affs_remount
	afs: Remove set but not used variables 'before', 'after'
	dmaengine: ti: edma: fix missed failure handling
	drm/radeon: fix bad DMA from INTERRUPT_CNTL2
	arm64: dts: juno: Fix UART frequency
	samples/bpf: Fix broken xdp_rxq_info due to map order assumptions
	usb: dwc3: Allow building USB_DWC3_QCOM without EXTCON
	IB/iser: Fix dma_nents type definition
	serial: stm32: fix clearing interrupt error flags
	arm64: dts: meson-gxm-khadas-vim2: fix uart_A bluetooth node
	m68k: Call timer_interrupt() with interrupts disabled
	Linux 4.19.99

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ieabeab79ea5c8cb4b6b1552702fa5d6100cea5db
2020-01-27 15:55:44 +01:00
Eric Dumazet
a92c895e22 tcp: annotate lockless access to tcp_memory_pressure
[ Upstream commit 1f142c17d19a5618d5a633195a46f2c8be9bf232 ]

tcp_memory_pressure is read without holding any lock,
and its value could be changed on other cpus.

Use READ_ONCE() to annotate these lockless reads.

The write side is already using atomic ops.

Fixes: b8da51ebb1 ("tcp: introduce tcp_under_memory_pressure()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:51:18 +01:00
Eric Dumazet
b0fb910bfd net: add {READ|WRITE}_ONCE() annotations on ->rskq_accept_head
[ Upstream commit 60b173ca3d1cd1782bd0096dc17298ec242f6fb1 ]

reqsk_queue_empty() is called from inet_csk_listen_poll() while
other cpus might write ->rskq_accept_head value.

Use {READ|WRITE}_ONCE() to avoid compiler tricks
and potential KCSAN splats.

Fixes: fff1f3001c ("tcp: add a spinlock to protect struct request_sock_queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:51:18 +01:00
Fred Klassen
1441242c35 net/udp_gso: Allow TX timestamp with UDP GSO
[ Upstream commit 76e21533a48bb42d1fa894f93f6233bf4554f45e ]

Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
This can be illustrated with an updated updgso_bench_tx program which
includes the '-T' option to test for this condition. It also introduces
the '-P' option which will call poll() before reading the error queue.

    ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
    poll timeout
    udp tx:      0 MB/s        1 calls/s      1 msg/s

The "poll timeout" message above indicates that TX timestamp never
arrived.

This patch preserves tx_flags for the first UDP GSO segment. Only the
first segment is timestamped, even though in some cases there may be
benefital in timestamping both the first and last segment.

Factors in deciding on first segment timestamp only:

- Timestamping both first and last segmented is not feasible. Hardware
can only have one outstanding TS request at a time.

- Timestamping last segment may under report network latency of the
previous segments. Even though the doorbell is suppressed, the ring
producer counter has been incremented.

- Timestamping the first segment has the upside in that it reports
timestamps from the application's view, e.g. RTT.

- Timestamping the first segment has the downside that it may
underreport tx host network latency. It appears that we have to pick
one or the other. And possibly follow-up with a config flag to choose
behavior.

v2: Remove tests as noted by Willem de Bruijn <willemb@google.com>
    Moving tests from net to net-next

v3: Update only relevant tx_flag bits as per
    Willem de Bruijn <willemb@google.com>

v4: Update comments and commit message as per
    Willem de Bruijn <willemb@google.com>

Fixes: ee80d1ebe5 ("udp: add udp gso")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:50:56 +01:00
Jakub Kicinski
7b245fbd23 net: don't clear sock->sk early to avoid trouble in strparser
[ Upstream commit 2b81f8161dfeda4017cef4f2498ccb64b13f0d61 ]

af_inet sets sock->sk to NULL which trips strparser over:

BUG: kernel NULL pointer dereference, address: 0000000000000012
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.2.0-rc1-00139-g14629453a6d3 #21
RIP: 0010:tcp_peek_len+0x10/0x60
RSP: 0018:ffffc02e41c54b98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9cf924c4e030 RCX: 0000000000000051
RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff9cf97128f480
RBP: ffff9cf9365e0300 R08: ffff9cf94fe7d2c0 R09: 0000000000000000
R10: 000000000000036b R11: ffff9cf939735e00 R12: ffff9cf91ad9ae40
R13: ffff9cf924c4e000 R14: ffff9cf9a8fcbaae R15: 0000000000000020
FS: 0000000000000000(0000) GS:ffff9cf9af7c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000012 CR3: 000000013920a003 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
 <IRQ>
 strp_data_ready+0x48/0x90
 tls_data_ready+0x22/0xd0 [tls]
 tcp_rcv_established+0x569/0x620
 tcp_v4_do_rcv+0x127/0x1e0
 tcp_v4_rcv+0xad7/0xbf0
 ip_protocol_deliver_rcu+0x2c/0x1c0
 ip_local_deliver_finish+0x41/0x50
 ip_local_deliver+0x6b/0xe0
 ? ip_protocol_deliver_rcu+0x1c0/0x1c0
 ip_rcv+0x52/0xd0
 ? ip_rcv_finish_core.isra.20+0x380/0x380
 __netif_receive_skb_one_core+0x7e/0x90
 netif_receive_skb_internal+0x42/0xf0
 napi_gro_receive+0xed/0x150
 nfp_net_poll+0x7a2/0xd30 [nfp]
 ? kmem_cache_free_bulk+0x286/0x310
 net_rx_action+0x149/0x3b0
 __do_softirq+0xe3/0x30a
 ? handle_irq_event_percpu+0x6a/0x80
 irq_exit+0xe8/0xf0
 do_IRQ+0x85/0xd0
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0010:cpuidle_enter_state+0xbc/0x450

To avoid this issue set sock->sk after sk_prot->close.
My grepping and testing did not discover any code which
would depend on the current behaviour.

Fixes: c46234ebb4 ("tls: RX path for ktls")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:50:52 +01:00
wenxu
7761d0c1c4 ip_tunnel: Fix route fl4 init in ip_md_tunnel_xmit
[ Upstream commit 6e6b904ad4f9aed43ec320afbd5a52ed8461ab41 ]

Init the gre_key from tuninfo->key.tun_id and init the mark
from the skb->mark, set the oif to zero in the collect metadata
mode.

Fixes: cfc7381b30 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:50:16 +01:00
Willem de Bruijn
a03e8f192a net: always initialize pagedlen
[ Upstream commit aba36930a35e7f1fe1319b203f25c05d6c119936 ]

In ip packet generation, pagedlen is initialized for each skb at the
start of the loop in __ip(6)_append_data, before label alloc_new_skb.

Depending on compiler options, code can be generated that jumps to
this label, triggering use of an an uninitialized variable.

In practice, at -O2, the generated code moves the initialization below
the label. But the code should not rely on that for correctness.

Fixes: 15e36f5b8e ("udp: paged allocation with gso")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:50:03 +01:00
Greg Kroah-Hartman
8cb4870403 This is the 4.19.98 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4pSYMACgkQONu9yGCS
 aT7Rkg/8C/AXaTp+2HxRj3ZO56uzpMBMb5duBzdzxnEnvFp+DIM7xxRX+NFI5CSK
 4rjnxMd2tPsFtqiWo/bBCUcHh9gu5HJKOMFRZGaRYAXvJ/8hgahgzkBE00JiAB6r
 mrk9Y/pwcKxMFsAHtu3xM0oENeefXOmavVTHc9N3DQLd3hNuyTrPztBMFaDg8djR
 pSwh1uE2G+Z2UOdi2kXmHiEIG6NViIqp+qFYI5CUIyeKfvOEsR5nSQ97LyNQ+dUX
 qshARQFuk78+Ax+GNPTQXiWdzN7+SH5aw5frFtdhAN90F+XrRDj4ZXw+EkX+/M2J
 NZU9P/v41ESG8RWxbAZ6osAUkQ4Dgq2BQpdyRxNNjTchXc0Kr4K6BCKuhY6cGxS7
 0PXPV7MsuAHYIrIvzG2lqif9gmknA0UrGVKuYJIZxBaWlHD2mEkFby0W0HIcBwir
 yKKK3fkFjmsGKYzh+VZVoGySWDbs7qYASWXHOCz0QCLb0CT8/ePbyxLdjY7u5KyX
 wDaDHXG9nm6Nu68HD/9CRnUkiK8dnsODZ0k+sBZfEa+xvHPJCdv3gnrf4SwU7dj7
 ZyhO9XkFzncOJDoxYxiXTfI+zbU1ZhaDw7fk2PFvAI6P1xRS3m6rp8pDWp8iw/MX
 92Sz1YzS68+otHLi+OBGxzu10PwMDtu2nUvqn68SYq6Rp0mZnnE=
 =2O94
 -----END PGP SIGNATURE-----

Merge 4.19.98 into android-4.19

Changes in 4.19.98
	ARM: dts: meson8: fix the size of the PMU registers
	clk: qcom: gcc-sdm845: Add missing flag to votable GDSCs
	dt-bindings: reset: meson8b: fix duplicate reset IDs
	ARM: dts: imx6q-dhcom: fix rtc compatible
	clk: Don't try to enable critical clocks if prepare failed
	ASoC: msm8916-wcd-digital: Reset RX interpolation path after use
	iio: buffer: align the size of scan bytes to size of the largest element
	USB: serial: simple: Add Motorola Solutions TETRA MTP3xxx and MTP85xx
	USB: serial: option: Add support for Quectel RM500Q
	USB: serial: opticon: fix control-message timeouts
	USB: serial: option: add support for Quectel RM500Q in QDL mode
	USB: serial: suppress driver bind attributes
	USB: serial: ch341: handle unbound port at reset_resume
	USB: serial: io_edgeport: handle unbound ports on URB completion
	USB: serial: io_edgeport: add missing active-port sanity check
	USB: serial: keyspan: handle unbound ports
	USB: serial: quatech2: handle unbound ports
	scsi: fnic: fix invalid stack access
	scsi: mptfusion: Fix double fetch bug in ioctl
	ASoC: msm8916-wcd-analog: Fix selected events for MIC BIAS External1
	ASoC: msm8916-wcd-analog: Fix MIC BIAS Internal1
	ARM: dts: imx6q-dhcom: Fix SGTL5000 VDDIO regulator connection
	ALSA: dice: fix fallback from protocol extension into limited functionality
	ALSA: seq: Fix racy access for queue timer in proc read
	ALSA: usb-audio: fix sync-ep altsetting sanity check
	arm64: dts: allwinner: a64: olinuxino: Fix SDIO supply regulator
	Fix built-in early-load Intel microcode alignment
	block: fix an integer overflow in logical block size
	ARM: dts: am571x-idk: Fix gpios property to have the correct gpio number
	LSM: generalize flag passing to security_capable
	ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
	usb: core: hub: Improved device recognition on remote wakeup
	x86/resctrl: Fix an imbalance in domain_remove_cpu()
	x86/CPU/AMD: Ensure clearing of SME/SEV features is maintained
	x86/efistub: Disable paging at mixed mode entry
	drm/i915: Add missing include file <linux/math64.h>
	x86/resctrl: Fix potential memory leak
	perf hists: Fix variable name's inconsistency in hists__for_each() macro
	perf report: Fix incorrectly added dimensions as switch perf data file
	mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment
	mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid
	btrfs: rework arguments of btrfs_unlink_subvol
	btrfs: fix invalid removal of root ref
	btrfs: do not delete mismatched root refs
	btrfs: fix memory leak in qgroup accounting
	mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()
	ARM: dts: imx6qdl: Add Engicam i.Core 1.5 MX6
	ARM: dts: imx6q-icore-mipi: Use 1.5 version of i.Core MX6DL
	ARM: dts: imx7: Fix Toradex Colibri iMX7S 256MB NAND flash support
	net: stmmac: 16KB buffer must be 16 byte aligned
	net: stmmac: Enable 16KB buffer size
	mm/huge_memory.c: make __thp_get_unmapped_area static
	mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment
	arm64: dts: agilex/stratix10: fix pmu interrupt numbers
	bpf: Fix incorrect verifier simulation of ARSH under ALU32
	cfg80211: fix deadlocks in autodisconnect work
	cfg80211: fix memory leak in cfg80211_cqm_rssi_update
	cfg80211: fix page refcount issue in A-MSDU decap
	netfilter: fix a use-after-free in mtype_destroy()
	netfilter: arp_tables: init netns pointer in xt_tgdtor_param struct
	netfilter: nft_tunnel: fix null-attribute check
	netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
	netfilter: nf_tables: store transaction list locally while requesting module
	netfilter: nf_tables: fix flowtable list del corruption
	NFC: pn533: fix bulk-message timeout
	batman-adv: Fix DAT candidate selection on little endian systems
	macvlan: use skb_reset_mac_header() in macvlan_queue_xmit()
	hv_netvsc: Fix memory leak when removing rndis device
	net: dsa: tag_qca: fix doubled Tx statistics
	net: hns: fix soft lockup when there is not enough memory
	net: usb: lan78xx: limit size of local TSO packets
	net/wan/fsl_ucc_hdlc: fix out of bounds write on array utdm_info
	ptp: free ptp device pin descriptors properly
	r8152: add missing endpoint sanity check
	tcp: fix marked lost packets not being retransmitted
	sh_eth: check sh_eth_cpu_data::dual_port when dumping registers
	mlxsw: spectrum: Wipe xstats.backlog of down ports
	mlxsw: spectrum_qdisc: Include MC TCs in Qdisc counters
	xen/blkfront: Adjust indentation in xlvbd_alloc_gendisk
	tcp: refine rule to allow EPOLLOUT generation under mem pressure
	irqchip: Place CONFIG_SIFIVE_PLIC into the menu
	cw1200: Fix a signedness bug in cw1200_load_firmware()
	arm64: dts: meson-gxl-s905x-khadas-vim: fix gpio-keys-polled node
	cfg80211: check for set_wiphy_params
	tick/sched: Annotate lockless access to last_jiffies_update
	arm64: dts: marvell: Fix CP110 NAND controller node multi-line comment alignment
	Revert "arm64: dts: juno: add dma-ranges property"
	mtd: devices: fix mchp23k256 read and write
	drm/nouveau/bar/nv50: check bar1 vmm return value
	drm/nouveau/bar/gf100: ensure BAR is mapped
	drm/nouveau/mmu: qualify vmm during dtor
	reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattr
	scsi: esas2r: unlock on error in esas2r_nvram_read_direct()
	scsi: qla4xxx: fix double free bug
	scsi: bnx2i: fix potential use after free
	scsi: target: core: Fix a pr_debug() argument
	scsi: qla2xxx: Fix qla2x00_request_irqs() for MSI
	scsi: qla2xxx: fix rports not being mark as lost in sync fabric scan
	scsi: core: scsi_trace: Use get_unaligned_be*()
	perf probe: Fix wrong address verification
	clk: sprd: Use IS_ERR() to validate the return value of syscon_regmap_lookup_by_phandle()
	regulator: ab8500: Remove SYSCLKREQ from enum ab8505_regulator_id
	hwmon: (pmbus/ibm-cffps) Switch LEDs to blocking brightness call
	Linux 4.19.98

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I74a43a9e60734aec6d24b10374ba97de89172eca
2020-01-23 08:36:16 +01:00
Eric Dumazet
b23477d818 tcp: refine rule to allow EPOLLOUT generation under mem pressure
commit 216808c6ba6d00169fd2aa928ec3c0e63bef254f upstream.

At the time commit ce5ec44099 ("tcp: ensure epoll edge trigger
wakeup when write queue is empty") was added to the kernel,
we still had a single write queue, combining rtx and write queues.

Once we moved the rtx queue into a separate rb-tree, testing
if sk_write_queue is empty has been suboptimal.

Indeed, if we have packets in the rtx queue, we probably want
to delay the EPOLLOUT generation at the time incoming packets
will free them, making room, but more importantly avoiding
flooding application with EPOLLOUT events.

Solution is to use tcp_rtx_and_write_queues_empty() helper.

Fixes: 75c119afe1 ("tcp: implement rb-tree based retransmit queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 08:21:36 +01:00
Pengcheng Yang
34e855f998 tcp: fix marked lost packets not being retransmitted
[ Upstream commit e176b1ba476cf36f723cfcc7a9e57f3cb47dec70 ]

When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
If packet loss is detected at this time, retransmit_skb_hint will be set
to point to the current packet loss in tcp_verify_retransmit_hint(),
then the packets that were previously marked lost but not retransmitted
due to the restriction of cwnd will be skipped and cannot be
retransmitted.

To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
be reset only after all marked lost packets are retransmitted
(retrans_out >= lost_out), otherwise we need to traverse from
tcp_rtx_queue_head in tcp_xmit_retransmit_queue().

Packetdrill to demonstrate:

// Disable RACK and set max_reordering to keep things simple
    0 `sysctl -q net.ipv4.tcp_recovery=0`
   +0 `sysctl -q net.ipv4.tcp_max_reordering=3`

// Establish a connection
   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

  +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <...>
 +.01 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 8 data segments
   +0 write(4, ..., 8000) = 8000
   +0 > P. 1:8001(8000) ack 1

// Enter recovery and 1:3001 is marked lost
 +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
   +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
   +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>

// Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
   +0 > . 1:1001(1000) ack 1

// 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
 +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
// Now retransmit_skb_hint points to 4001:5001 which is now marked lost

// BUG: 2001:3001 was not retransmitted
   +0 > . 2001:3001(1000) ack 1

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 08:21:35 +01:00
Florian Westphal
e3282417b9 netfilter: arp_tables: init netns pointer in xt_tgdtor_param struct
commit 212e7f56605ef9688d0846db60c6c6ec06544095 upstream.

An earlier commit (1b789577f655060d98d20e,
"netfilter: arp_tables: init netns pointer in xt_tgchk_param struct")
fixed missing net initialization for arptables, but turns out it was
incomplete.  We can get a very similar struct net NULL deref during
error unwinding:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:xt_rateest_put+0xa1/0x440 net/netfilter/xt_RATEEST.c:77
 xt_rateest_tg_destroy+0x72/0xa0 net/netfilter/xt_RATEEST.c:175
 cleanup_entry net/ipv4/netfilter/arp_tables.c:509 [inline]
 translate_table+0x11f4/0x1d80 net/ipv4/netfilter/arp_tables.c:587
 do_replace net/ipv4/netfilter/arp_tables.c:981 [inline]
 do_arpt_set_ctl+0x317/0x650 net/ipv4/netfilter/arp_tables.c:1461

Also init the netns pointer in xt_tgdtor_param struct.

Fixes: add6746124 ("netfilter: add struct net * to target parameters")
Reported-by: syzbot+91bdd8eece0f6629ec8b@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 08:21:33 +01:00
Greg Kroah-Hartman
21e3a71314 This is the 4.19.96 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4eEV4ACgkQONu9yGCS
 aT6D6Q//W205i6vBtK440pV659K29fw2+XYhJnG+3kkcXJ1GBEHlS7ddZCzV1lua
 uOLYxG2J0A3D0WC8S78XwE1ZXe1YTor647qW0H6pDeZIj/ID2x4TEN6rZq0YlCb7
 JLx6d+yb7MkK9fHGXVyTZbKESLr7WimDnZckWbO4xWTqtlh195ygMYxdwTxa97ZP
 rMJXR/qS/n5rlgQE8BEFsPZBNcsSVKecIi8ibSWu8zDJerBb2h0TuUMKvn1fteG9
 O69y8MeoJSWGTsorii+f7toGKhECFRWSrDj9GhK4SFTj3eV5evrkozAWJpXuzIOA
 9ou9OHUQnJxpXaVyBxiUDkP58tDYYddsrSCHFSNPtPcsTJSxDUEwZBDXr7V8TyVq
 axjoXgHwXOg9qi5IvEOvOBIahTX9OdKvR578eLQIFinZTRmZGKl2XW08olybcUw3
 ZpTBW6A97Xmhs4szFOQfgxRrc3n0F4+Mk9tFp2HFzSu23y6A/wwctIt4xoOWJ8Nq
 ouZbGwI3Em6rtTWDufNJbagm5QYLEcsS5F4Ala6uGIxBXs8mD7dAisILEXxFfSAn
 aPzPvsKh1vMhYBFmTStsowxy1pjeuUMHukrBQLDtmhzR4aBs5GXCJi/Iq2ojC123
 LLAnG/ytLyYyVUxpYyNMsconLITZ1iS3iUAIcUHeYcEFZ3JAV3E=
 =D3a0
 -----END PGP SIGNATURE-----

Merge 4.19.96 into android-4.19

Changes in 4.19.96
	chardev: Avoid potential use-after-free in 'chrdev_open()'
	i2c: fix bus recovery stop mode timing
	usb: chipidea: host: Disable port power only if previously enabled
	ALSA: usb-audio: Apply the sample rate quirk for Bose Companion 5
	ALSA: hda/realtek - Add new codec supported for ALCS1200A
	ALSA: hda/realtek - Set EAPD control to default for ALC222
	ALSA: hda/realtek - Add quirk for the bass speaker on Lenovo Yoga X1 7th gen
	kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail
	tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined
	tracing: Change offset type to s32 in preempt/irq tracepoints
	HID: Fix slab-out-of-bounds read in hid_field_extract
	HID: uhid: Fix returning EPOLLOUT from uhid_char_poll
	HID: hid-input: clear unmapped usages
	Input: add safety guards to input_set_keycode()
	Input: input_event - fix struct padding on sparc64
	drm/sun4i: tcon: Set RGB DCLK min. divider based on hardware model
	drm/fb-helper: Round up bits_per_pixel if possible
	drm/dp_mst: correct the shifting in DP_REMOTE_I2C_READ
	can: kvaser_usb: fix interface sanity check
	can: gs_usb: gs_usb_probe(): use descriptors of current altsetting
	can: mscan: mscan_rx_poll(): fix rx path lockup when returning from polling to irq mode
	can: can_dropped_invalid_skb(): ensure an initialized headroom in outgoing CAN sk_buffs
	gpiolib: acpi: Turn dmi_system_id table into a generic quirk table
	gpiolib: acpi: Add honor_wakeup module-option + quirk mechanism
	staging: vt6656: set usb_set_intfdata on driver fail.
	USB: serial: option: add ZLP support for 0x1bc7/0x9010
	usb: musb: fix idling for suspend after disconnect interrupt
	usb: musb: Disable pullup at init
	usb: musb: dma: Correct parameter passed to IRQ handler
	staging: comedi: adv_pci1710: fix AI channels 16-31 for PCI-1713
	staging: rtl8188eu: Add device code for TP-Link TL-WN727N v5.21
	serdev: Don't claim unsupported ACPI serial devices
	tty: link tty and port before configuring it as console
	tty: always relink the port
	mwifiex: fix possible heap overflow in mwifiex_process_country_ie()
	mwifiex: pcie: Fix memory leak in mwifiex_pcie_alloc_cmdrsp_buf
	scsi: bfa: release allocated memory in case of error
	rtl8xxxu: prevent leaking urb
	ath10k: fix memory leak
	HID: hiddev: fix mess in hiddev_open()
	USB: Fix: Don't skip endpoint descriptors with maxpacket=0
	phy: cpcap-usb: Fix error path when no host driver is loaded
	phy: cpcap-usb: Fix flakey host idling and enumerating of devices
	netfilter: arp_tables: init netns pointer in xt_tgchk_param struct
	netfilter: conntrack: dccp, sctp: handle null timeout argument
	netfilter: ipset: avoid null deref when IPSET_ATTR_LINENO is present
	drm/i915/gen9: Clear residual context state on context switch
	Linux 4.19.96

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I261d6a9e90a5461701f74e3ca1482e3c00939f3e
2020-01-15 08:57:09 +01:00
Florian Westphal
46abb2a5cd netfilter: arp_tables: init netns pointer in xt_tgchk_param struct
commit 1b789577f655060d98d20ed0c6f9fbd469d6ba63 upstream.

We get crash when the targets checkentry function tries to make
use of the network namespace pointer for arptables.

When the net pointer got added back in 2010, only ip/ip6/ebtables were
changed to initialize it, so arptables has this set to NULL.

This isn't a problem for normal arptables because no existing
arptables target has a checkentry function that makes use of par->net.

However, direct users of the setsockopt interface can provide any
target they want as long as its registered for ARP or UNPSEC protocols.

syzkaller managed to send a semi-valid arptables rule for RATEEST target
which is enough to trigger NULL deref:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: xt_rateest_tg_checkentry+0x11d/0xb40 net/netfilter/xt_RATEEST.c:109
[..]
 xt_check_target+0x283/0x690 net/netfilter/x_tables.c:1019
 check_target net/ipv4/netfilter/arp_tables.c:399 [inline]
 find_check_entry net/ipv4/netfilter/arp_tables.c:422 [inline]
 translate_table+0x1005/0x1d70 net/ipv4/netfilter/arp_tables.c:572
 do_replace net/ipv4/netfilter/arp_tables.c:977 [inline]
 do_arpt_set_ctl+0x310/0x640 net/ipv4/netfilter/arp_tables.c:1456

Fixes: add6746124 ("netfilter: add struct net * to target parameters")
Reported-by: syzbot+d7358a458d8a81aee898@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-14 20:07:08 +01:00
Greg Kroah-Hartman
5da11144c3 This is the 4.19.95 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4bAEoACgkQONu9yGCS
 aT70DQ//YgSd3JJ9/d7TLy+Mm8GrlZIZWgRrz8YZdKcJBG+l/c+m6FR0uLDw95nf
 zIFq1GY8DwS3djuNkxPoz8yvIgqoHuSjEUo+YF8v71ZdnGCXIt6SCoKbugAH+azp
 zTEJ4lU3/Wrc1Gh4w5zeSgdQVPIJBQ+d7pKccZHOJ0DFwbU3hQ69vqXcaxrFuhop
 vqKvNNEeCT2l1AxgAhKwNhceFL3Rb/2HxAjUED68ueY/EZ0/OsoinboBu2riSKNu
 NWZTPq7B2Kht19aV48FThwksEvkelz72gkSzKvrsT03tDtNewQ7C/mreUqpU6VZE
 lrcIWOHjXMQyk3g3XZfG3j8ppzHvZPaG/WEqmnJV8txjtaRU8hvj3Q95kHPheTv6
 O/Ds4OpHBaS5g6+dmsUmtfSrbrhm3KKXMn0IJvyYkC/VY0RVdLKTZu0TM5ZS5id/
 zW4AZWWzephUID9LAwRrTWvfGKMBK6gEiv2AtnY4XRiEMYEFw45uP97yWcUCg89L
 a3BCbhhiO4tMqL/lf7KD4vPbXZsRgDM3q1wLitvM2KCVVSA32XGRMT9x8A6uwTJl
 OLL39NSi8h8rqn5S22AowKcR/3VakRNw9pf5Y6AH5xOnP3R/OBHzYJNho2DUc8OI
 6p+zeUZkDEIsFI1wdJSaaSWsNdpJj3Sw6SdB2QJI3c782F4pFUA=
 =RCVS
 -----END PGP SIGNATURE-----

Merge 4.19.95 into android-4.19

Changes in 4.19.95
	USB: dummy-hcd: use usb_urb_dir_in instead of usb_pipein
	USB: dummy-hcd: increase max number of devices to 32
	bpf: Fix passing modified ctx to ld/abs/ind instruction
	regulator: fix use after free issue
	ASoC: max98090: fix possible race conditions
	locking/spinlock/debug: Fix various data races
	netfilter: ctnetlink: netns exit must wait for callbacks
	mwifiex: Fix heap overflow in mmwifiex_process_tdls_action_frame()
	libtraceevent: Fix lib installation with O=
	x86/efi: Update e820 with reserved EFI boot services data to fix kexec breakage
	ASoC: Intel: bytcr_rt5640: Update quirk for Teclast X89
	efi/gop: Return EFI_NOT_FOUND if there are no usable GOPs
	efi/gop: Return EFI_SUCCESS if a usable GOP was found
	efi/gop: Fix memory leak in __gop_query32/64()
	ARM: dts: imx6ul: imx6ul-14x14-evk.dtsi: Fix SPI NOR probing
	ARM: vexpress: Set-up shared OPP table instead of individual for each CPU
	netfilter: uapi: Avoid undefined left-shift in xt_sctp.h
	netfilter: nft_set_rbtree: bogus lookup/get on consecutive elements in named sets
	netfilter: nf_tables: validate NFT_SET_ELEM_INTERVAL_END
	netfilter: nf_tables: validate NFT_DATA_VALUE after nft_data_init()
	ARM: dts: BCM5301X: Fix MDIO node address/size cells
	selftests/ftrace: Fix multiple kprobe testcase
	ARM: dts: Cygnus: Fix MDIO node address/size cells
	spi: spi-cavium-thunderx: Add missing pci_release_regions()
	ASoC: topology: Check return value for soc_tplg_pcm_create()
	ARM: dts: bcm283x: Fix critical trip point
	bnxt_en: Return error if FW returns more data than dump length
	bpf, mips: Limit to 33 tail calls
	spi: spi-ti-qspi: Fix a bug when accessing non default CS
	ARM: dts: am437x-gp/epos-evm: fix panel compatible
	samples: bpf: Replace symbol compare of trace_event
	samples: bpf: fix syscall_tp due to unused syscall
	powerpc: Ensure that swiotlb buffer is allocated from low memory
	btrfs: Fix error messages in qgroup_rescan_init
	bpf: Clear skb->tstamp in bpf_redirect when necessary
	bnx2x: Do not handle requests from VFs after parity
	bnx2x: Fix logic to get total no. of PFs per engine
	cxgb4: Fix kernel panic while accessing sge_info
	net: usb: lan78xx: Fix error message format specifier
	parisc: add missing __init annotation
	rfkill: Fix incorrect check to avoid NULL pointer dereference
	ASoC: wm8962: fix lambda value
	regulator: rn5t618: fix module aliases
	iommu/iova: Init the struct iova to fix the possible memleak
	kconfig: don't crash on NULL expressions in expr_eq()
	perf/x86/intel: Fix PT PMI handling
	fs: avoid softlockups in s_inodes iterators
	net: stmmac: Do not accept invalid MTU values
	net: stmmac: xgmac: Clear previous RX buffer size
	net: stmmac: RX buffer size must be 16 byte aligned
	net: stmmac: Always arm TX Timer at end of transmission start
	s390/purgatory: do not build purgatory with kcov, kasan and friends
	drm/exynos: gsc: add missed component_del
	s390/dasd/cio: Interpret ccw_device_get_mdc return value correctly
	s390/dasd: fix memleak in path handling error case
	block: fix memleak when __blk_rq_map_user_iov() is failed
	parisc: Fix compiler warnings in debug_core.c
	llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
	hv_netvsc: Fix unwanted rx_table reset
	powerpc/vcpu: Assume dedicated processors as non-preempt
	powerpc/spinlocks: Include correct header for static key
	cpufreq: imx6q: read OCOTP through nvmem for imx6ul/imx6ull
	ARM: dts: imx6ul: use nvmem-cells for cpu speed grading
	PCI/switchtec: Read all 64 bits of part_event_bitmap
	gtp: fix bad unlock balance in gtp_encap_enable_socket
	macvlan: do not assume mac_header is set in macvlan_broadcast()
	net: dsa: mv88e6xxx: Preserve priority when setting CPU port.
	net: stmmac: dwmac-sun8i: Allow all RGMII modes
	net: stmmac: dwmac-sunxi: Allow all RGMII modes
	net: usb: lan78xx: fix possible skb leak
	pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM
	sch_cake: avoid possible divide by zero in cake_enqueue()
	sctp: free cmd->obj.chunk for the unprocessed SCTP_CMD_REPLY
	tcp: fix "old stuff" D-SACK causing SACK to be treated as D-SACK
	vxlan: fix tos value before xmit
	vlan: fix memory leak in vlan_dev_set_egress_priority
	vlan: vlan_changelink() should propagate errors
	mlxsw: spectrum_qdisc: Ignore grafting of invisible FIFO
	net: sch_prio: When ungrafting, replace with FIFO
	usb: dwc3: gadget: Fix request complete check
	USB: core: fix check for duplicate endpoints
	USB: serial: option: add Telit ME910G1 0x110a composition
	usb: missing parentheses in USE_NEW_SCHEME
	Linux 4.19.95

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I611f034c58f975a5d2b70eed0a0884f1ff5b09cc
2020-01-12 12:29:19 +01:00
Pengcheng Yang
5ea5b6e3c6 tcp: fix "old stuff" D-SACK causing SACK to be treated as D-SACK
[ Upstream commit c9655008e7845bcfdaac10a1ed8554ec167aea88 ]

When we receive a D-SACK, where the sequence number satisfies:
	undo_marker <= start_seq < end_seq <= prior_snd_una
we consider this is a valid D-SACK and tcp_is_sackblock_valid()
returns true, then this D-SACK is discarded as "old stuff",
but the variable first_sack_index is not marked as negative
in tcp_sacktag_write_queue().

If this D-SACK also carries a SACK that needs to be processed
(for example, the previous SACK segment was lost), this SACK
will be treated as a D-SACK in the following processing of
tcp_sacktag_write_queue(), which will eventually lead to
incorrect updates of undo_retrans and reordering.

Fixes: fd6dad616d ("[TCP]: Earlier SACK block verification & simplify access to them")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12 12:17:27 +01:00
Greg Kroah-Hartman
ff0e96e80f This is the 4.19.94 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4W8A4ACgkQONu9yGCS
 aT5ZcBAAha0GMcpxm1ettNVMXUVD/Df2pntc3x2G1T+dtI89YwIilJcdQBpbDGB6
 6oNRpnopc+/ynm820SMlhjBNE8KlDzHS3Tmsn1lplru0yOqZMFcFlHSESCAA0b4E
 T21KwQ4rLZTzW4LvTf//4WpJZD1RnVrwKkbgkci9kvCjZsdh2GMK3XkBeVBUdXuX
 3gvW+9zsgmkU3Bhk5Mi8JUmOw2yY5sJt2tDmIyxOtBknAo1TK6n4kqd+NgjfsdcI
 cnNTstDU0Ikmi4UBOZGDmey0THtHdvi/oM3DUkzHtZ68W0rg/gPu4nDR+Fx3sKvo
 y5bI10j4W16PKXyxVehel+lD8XmIV/+zSerS0enGjijBPZKI9FTlGEuczk0x7sj+
 wqMh3WkkPig2bQPrCOIjkA5VW4n/ZL07cjd1nNeZ48MkvA/3k47o4vDV/lPE88ZT
 49qqaJvZ3kAdqtV1pfzpQtrG1Pp8YPcEHAMYIM/6jb6poCro5vFtuRX4tzj2fRSf
 u7jSVPDt7ED9SgHPe+RrGWVIx2/iVnr5mVdi53rjWTWfeTdNIY5bUs/wsTde1k99
 9bnAhwD4ZbFrO240MMYPWpOCr8kl0LXAeyQ104m7ONbMRnLoRp4sQCae252jyHFD
 Qxgez5cDwDQnj2W4/SdXSWytioTnyVHsI89FkWw+f/IU8AsbBuw=
 =mmeT
 -----END PGP SIGNATURE-----

Merge 4.19.94 into android-4.19

Changes in 4.19.94
	nvme_fc: add module to ops template to allow module references
	nvme-fc: fix double-free scenarios on hw queues
	drm/amdgpu: add check before enabling/disabling broadcast mode
	drm/amdgpu: add cache flush workaround to gfx8 emit_fence
	drm/amd/display: Fixed kernel panic when booting with DP-to-HDMI dongle
	iio: adc: max9611: Fix too short conversion time delay
	PM / devfreq: Fix devfreq_notifier_call returning errno
	PM / devfreq: Set scaling_max_freq to max on OPP notifier error
	PM / devfreq: Don't fail devfreq_dev_release if not in list
	afs: Fix afs_find_server lookups for ipv4 peers
	afs: Fix SELinux setting security label on /afs
	RDMA/cma: add missed unregister_pernet_subsys in init failure
	rxe: correctly calculate iCRC for unaligned payloads
	scsi: lpfc: Fix memory leak on lpfc_bsg_write_ebuf_set func
	scsi: qla2xxx: Drop superfluous INIT_WORK of del_work
	scsi: qla2xxx: Don't call qlt_async_event twice
	scsi: qla2xxx: Fix PLOGI payload and ELS IOCB dump length
	scsi: qla2xxx: Configure local loop for N2N target
	scsi: qla2xxx: Send Notify ACK after N2N PLOGI
	scsi: qla2xxx: Ignore PORT UPDATE after N2N PLOGI
	scsi: iscsi: qla4xxx: fix double free in probe
	scsi: libsas: stop discovering if oob mode is disconnected
	drm/nouveau: Move the declaration of struct nouveau_conn_atom up a bit
	usb: gadget: fix wrong endpoint desc
	net: make socket read/write_iter() honor IOCB_NOWAIT
	afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP
	md: raid1: check rdev before reference in raid1_sync_request func
	s390/cpum_sf: Adjust sampling interval to avoid hitting sample limits
	s390/cpum_sf: Avoid SBD overflow condition in irq handler
	IB/mlx4: Follow mirror sequence of device add during device removal
	IB/mlx5: Fix steering rule of drop and count
	xen-blkback: prevent premature module unload
	xen/balloon: fix ballooned page accounting without hotplug enabled
	PM / hibernate: memory_bm_find_bit(): Tighten node optimisation
	ALSA: hda/realtek - Add Bass Speaker and fixed dac for bass speaker
	ALSA: hda/realtek - Enable the bass speaker of ASUS UX431FLC
	ALSA: hda - fixup for the bass speaker on Lenovo Carbon X1 7th gen
	xfs: fix mount failure crash on invalid iclog memory access
	taskstats: fix data-race
	drm: limit to INT_MAX in create_blob ioctl
	netfilter: nft_tproxy: Fix port selector on Big Endian
	ALSA: ice1724: Fix sleep-in-atomic in Infrasonic Quartet support code
	ALSA: usb-audio: fix set_format altsetting sanity check
	ALSA: usb-audio: set the interface format after resume on Dell WD19
	ALSA: hda/realtek - Add headset Mic no shutup for ALC283
	drm/sun4i: hdmi: Remove duplicate cleanup calls
	MIPS: Avoid VDSO ABI breakage due to global register variable
	media: pulse8-cec: fix lost cec_transmit_attempt_done() call
	media: cec: CEC 2.0-only bcast messages were ignored
	media: cec: avoid decrementing transmit_queue_sz if it is 0
	media: cec: check 'transmit_in_progress', not 'transmitting'
	mm/zsmalloc.c: fix the migrated zspage statistics.
	memcg: account security cred as well to kmemcg
	mm: move_pages: return valid node id in status if the page is already on the target node
	pstore/ram: Write new dumps to start of recycled zones
	locks: print unsigned ino in /proc/locks
	dmaengine: Fix access to uninitialized dma_slave_caps
	compat_ioctl: block: handle Persistent Reservations
	compat_ioctl: block: handle BLKREPORTZONE/BLKRESETZONE
	ata: libahci_platform: Export again ahci_platform_<en/dis>able_phys()
	ata: ahci_brcm: Fix AHCI resources management
	ata: ahci_brcm: Allow optional reset controller to be used
	ata: ahci_brcm: Add missing clock management during recovery
	ata: ahci_brcm: BCM7425 AHCI requires AHCI_HFLAG_DELAY_ENGINE
	libata: Fix retrieving of active qcs
	gpiolib: fix up emulated open drain outputs
	riscv: ftrace: correct the condition logic in function graph tracer
	rseq/selftests: Fix: Namespace gettid() for compatibility with glibc 2.30
	tracing: Fix lock inversion in trace_event_enable_tgid_record()
	tracing: Avoid memory leak in process_system_preds()
	tracing: Have the histogram compare functions convert to u64 first
	tracing: Fix endianness bug in histogram trigger
	apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock
	ALSA: cs4236: fix error return comparison of an unsigned integer
	ALSA: firewire-motu: Correct a typo in the clock proc string
	exit: panic before exit_mm() on global init exit
	arm64: Revert support for execute-only user mappings
	ftrace: Avoid potential division by zero in function profiler
	drm/msm: include linux/sched/task.h
	PM / devfreq: Check NULL governor in available_governors_show
	nfsd4: fix up replay_matches_cache()
	HID: i2c-hid: Reset ALPS touchpads on resume
	ACPI: sysfs: Change ACPI_MASKABLE_GPE_MAX to 0x100
	xfs: don't check for AG deadlock for realtime files in bunmapi
	platform/x86: pmc_atom: Add Siemens CONNECT X300 to critclk_systems DMI table
	Bluetooth: btusb: fix PM leak in error case of setup
	Bluetooth: delete a stray unlock
	Bluetooth: Fix memory leak in hci_connect_le_scan
	media: flexcop-usb: ensure -EIO is returned on error condition
	regulator: ab8500: Remove AB8505 USB regulator
	media: usb: fix memory leak in af9005_identify_state
	dt-bindings: clock: renesas: rcar-usb2-clock-sel: Fix typo in example
	arm64: dts: meson: odroid-c2: Disable usb_otg bus to avoid power failed warning
	tty: serial: msm_serial: Fix lockup for sysrq and oops
	fix compat handling of FICLONERANGE, FIDEDUPERANGE and FS_IOC_FIEMAP
	bdev: Factor out bdev revalidation into a common helper
	bdev: Refresh bdev size for disks without partitioning
	scsi: qedf: Do not retry ELS request if qedf_alloc_cmd fails
	drm/mst: Fix MST sideband up-reply failure handling
	powerpc/pseries/hvconsole: Fix stack overread via udbg
	selftests: rtnetlink: add addresses with fixed life time
	KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
	rxrpc: Fix possible NULL pointer access in ICMP handling
	tcp: annotate tp->rcv_nxt lockless reads
	net: core: limit nested device depth
	ath9k_htc: Modify byte order for an error message
	ath9k_htc: Discard undersized packets
	xfs: periodically yield scrub threads to the scheduler
	net: add annotations on hh->hh_len lockless accesses
	ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps
	s390/smp: fix physical to logical CPU map for SMT
	xen/blkback: Avoid unmapping unmapped grant pages
	perf/x86/intel/bts: Fix the use of page_private()
	Linux 4.19.94

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic3d1a4e10565c38d0e82448f0fb7b6fd1822aab2
2020-01-09 16:14:43 +01:00
Eric Dumazet
67f028acac tcp: annotate tp->rcv_nxt lockless reads
[ Upstream commit dba7d9b8c739df27ff3a234c81d6c6b23e3986fa ]

There are few places where we fetch tp->rcv_nxt while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)

syzbot reported :

BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv

write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
 tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
 tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
 tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
 tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
 ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
 napi_skb_finish net/core/dev.c:5671 [inline]
 napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061

read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
 tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
 tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
 sock_poll+0xed/0x250 net/socket.c:1256
 vfs_poll include/linux/poll.h:90 [inline]
 ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
 ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
 ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
 ep_send_events fs/eventpoll.c:1793 [inline]
 ep_poll+0xe3/0x900 fs/eventpoll.c:1930
 do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
 __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
 __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
 __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
 do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09 10:19:08 +01:00
Greg Kroah-Hartman
287ec341d6 This is the 4.19.93 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4Q1jAACgkQONu9yGCS
 aT7vqg/9FEBVO/NARJYQ/R7Z6L4fQUNgHmFI0y9iaTP2nlHuVuBvMHJdF7BmidHF
 9iwe/lctPobgoknUoA3nmt8WmPmCaKbFhABsS03sz1Q5Z+IC1g218s4SUppER3fB
 YlgqRDjKY0wwk2MPAOgIPaRQCNSiaVZFo+bH1Mxrj77m8D7NHKXiZPlrbDunVlEB
 NA0DWOyb04JehRoRNbKTHzLBs/VfZ0LhxEO5sS17M2hhOauYAKAFmSzdMPJwv4ka
 qiCR+4zWYR5LF64mG5jxmerhUjIOrhRUc+334//WH4jCuo9xjKrCmxLIjqR7wwHC
 dK4Apu128Ujl4boHxLrFKIG3f2K19gZz6h+sWrcxjTzZ/YWPYjPI4atuWrZEJIG5
 nhhcz4fZfLAxMNm51kM9i4WAcP2k+CX1ynD0AuzXIZXs+t+xOoaUtYeFHc/tpmig
 P/AA4eAYjojQHPUwNeR+8GmjOGPfwSuTNkd6PqAaaI1cvGtHK0y5M38FNrut+I1k
 pvYvWOvtvWOsR6YaJviU2HF7uNFX0saNqJ4Ahmm/nxdlxOKRcKDIzDI7ibwcwEOQ
 E20SZdPQG/oiaXq0itSstpDuYJ9hKr5YehPS7uAXvy0RT/H7J5cpSZuCUK74J4Zr
 rC2D5M99rW9aztpfEQxU6CTluIGLZ+eBp2pKTU420jkySxmOo6o=
 =qgtu
 -----END PGP SIGNATURE-----

Merge 4.19.93 into android-4.19

Changes in 4.19.93
	scsi: lpfc: Fix discovery failures when target device connectivity bounces
	scsi: mpt3sas: Fix clear pending bit in ioctl status
	scsi: lpfc: Fix locking on mailbox command completion
	Input: atmel_mxt_ts - disable IRQ across suspend
	f2fs: fix to update time in lazytime mode
	iommu: rockchip: Free domain on .domain_free
	iommu/tegra-smmu: Fix page tables in > 4 GiB memory
	dmaengine: xilinx_dma: Clear desc_pendingcount in xilinx_dma_reset
	scsi: target: compare full CHAP_A Algorithm strings
	scsi: lpfc: Fix SLI3 hba in loop mode not discovering devices
	scsi: csiostor: Don't enable IRQs too early
	scsi: hisi_sas: Replace in_softirq() check in hisi_sas_task_exec()
	powerpc/pseries: Mark accumulate_stolen_time() as notrace
	powerpc/pseries: Don't fail hash page table insert for bolted mapping
	powerpc/tools: Don't quote $objdump in scripts
	dma-debug: add a schedule point in debug_dma_dump_mappings()
	leds: lm3692x: Handle failure to probe the regulator
	clocksource/drivers/asm9260: Add a check for of_clk_get
	clocksource/drivers/timer-of: Use unique device name instead of timer
	powerpc/security/book3s64: Report L1TF status in sysfs
	powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
	ext4: update direct I/O read lock pattern for IOCB_NOWAIT
	ext4: iomap that extends beyond EOF should be marked dirty
	jbd2: Fix statistics for the number of logged blocks
	scsi: tracing: Fix handling of TRANSFER LENGTH == 0 for READ(6) and WRITE(6)
	scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow
	f2fs: fix to update dir's i_pino during cross_rename
	clk: qcom: Allow constant ratio freq tables for rcg
	clk: clk-gpio: propagate rate change to parent
	irqchip/irq-bcm7038-l1: Enable parent IRQ if necessary
	irqchip: ingenic: Error out if IRQ domain creation failed
	fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
	scsi: lpfc: fix: Coverity: lpfc_cmpl_els_rsp(): Null pointer dereferences
	PCI: rpaphp: Fix up pointer to first drc-info entry
	scsi: ufs: fix potential bug which ends in system hang
	powerpc/pseries/cmm: Implement release() function for sysfs device
	PCI: rpaphp: Don't rely on firmware feature to imply drc-info support
	PCI: rpaphp: Annotate and correctly byte swap DRC properties
	PCI: rpaphp: Correctly match ibm, my-drc-index to drc-name when using drc-info
	powerpc/security: Fix wrong message when RFI Flush is disable
	scsi: atari_scsi: sun3_scsi: Set sg_tablesize to 1 instead of SG_NONE
	clk: pxa: fix one of the pxa RTC clocks
	bcache: at least try to shrink 1 node in bch_mca_scan()
	HID: quirks: Add quirk for HP MSU1465 PIXART OEM mouse
	HID: logitech-hidpp: Silence intermittent get_battery_capacity errors
	ARM: 8937/1: spectre-v2: remove Brahma-B53 from hardening
	libnvdimm/btt: fix variable 'rc' set but not used
	HID: Improve Windows Precision Touchpad detection.
	HID: rmi: Check that the RMI_STARTED bit is set before unregistering the RMI transport device
	watchdog: Fix the race between the release of watchdog_core_data and cdev
	scsi: pm80xx: Fix for SATA device discovery
	scsi: ufs: Fix error handing during hibern8 enter
	scsi: scsi_debug: num_tgts must be >= 0
	scsi: NCR5380: Add disconnect_mask module parameter
	scsi: iscsi: Don't send data to unbound connection
	scsi: target: iscsi: Wait for all commands to finish before freeing a session
	gpio: mpc8xxx: Don't overwrite default irq_set_type callback
	apparmor: fix unsigned len comparison with less than zero
	scripts/kallsyms: fix definitely-lost memory leak
	powerpc: Don't add -mabi= flags when building with Clang
	cdrom: respect device capabilities during opening action
	perf script: Fix brstackinsn for AUXTRACE
	perf regs: Make perf_reg_name() return "unknown" instead of NULL
	s390/zcrypt: handle new reply code FILTERED_BY_HYPERVISOR
	libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h
	s390/cpum_sf: Check for SDBT and SDB consistency
	ocfs2: fix passing zero to 'PTR_ERR' warning
	mailbox: imx: Fix Tx doorbell shutdown path
	kernel: sysctl: make drop_caches write-only
	userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
	Revert "powerpc/vcpu: Assume dedicated processors as non-preempt"
	x86/mce: Fix possibly incorrect severity calculation on AMD
	net, sysctl: Fix compiler warning when only cBPF is present
	netfilter: nf_queue: enqueue skbs with NULL dst
	ALSA: hda - Downgrade error message for single-cmd fallback
	bonding: fix active-backup transition after link failure
	perf strbuf: Remove redundant va_end() in strbuf_addv()
	Make filldir[64]() verify the directory entry filename is valid
	filldir[64]: remove WARN_ON_ONCE() for bad directory entries
	netfilter: ebtables: compat: reject all padding in matches/watchers
	6pack,mkiss: fix possible deadlock
	netfilter: bridge: make sure to pull arp header in br_nf_forward_arp()
	inetpeer: fix data-race in inet_putpeer / inet_putpeer
	net: add a READ_ONCE() in skb_peek_tail()
	net: icmp: fix data-race in cmp_global_allow()
	hrtimer: Annotate lockless access to timer->state
	net: ena: fix napi handler misbehavior when the napi budget is zero
	net/mlxfw: Fix out-of-memory error in mfa2 flash burning
	net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs
	ptp: fix the race between the release of ptp_clock and cdev
	tcp: Fix highest_sack and highest_sack_seq
	udp: fix integer overflow while computing available space in sk_rcvbuf
	vhost/vsock: accept only packets with the right dst_cid
	net: add bool confirm_neigh parameter for dst_ops.update_pmtu
	ip6_gre: do not confirm neighbor when do pmtu update
	gtp: do not confirm neighbor when do pmtu update
	net/dst: add new function skb_dst_update_pmtu_no_confirm
	tunnel: do not confirm neighbor when do pmtu update
	vti: do not confirm neighbor when do pmtu update
	sit: do not confirm neighbor when do pmtu update
	net/dst: do not confirm neighbor for vxlan and geneve pmtu update
	gtp: do not allow adding duplicate tid and ms_addr pdp context
	net: marvell: mvpp2: phylink requires the link interrupt
	tcp/dccp: fix possible race __inet_lookup_established()
	tcp: do not send empty skb from tcp_write_xmit()
	gtp: fix wrong condition in gtp_genl_dump_pdp()
	gtp: fix an use-after-free in ipv4_pdp_find()
	gtp: avoid zero size hashtable
	spi: fsl: don't map irq during probe
	tty/serial: atmel: fix out of range clock divider handling
	pinctrl: baytrail: Really serialize all register accesses
	spi: fsl: use platform_get_irq() instead of of_irq_to_resource()
	Linux 4.19.93

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie31b3fba19c5a45be0b85f272bc50cb8b67ea3c0
2020-01-04 19:29:03 +01:00
Eric Dumazet
d3f384f579 tcp: do not send empty skb from tcp_write_xmit()
[ Upstream commit 1f85e6267caca44b30c54711652b0726fadbb131 ]

Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
write queue in error cases") in linux-4.14 stable triggered
various bugs. One of them has been fixed in commit ba2ddb43f270
("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
we still have crashes in some occasions.

Root-cause is that when tcp_sendmsg() has allocated a fresh
skb and could not append a fragment before being blocked
in sk_stream_wait_memory(), tcp_write_xmit() might be called
and decide to send this fresh and empty skb.

Sending an empty packet is not only silly, it might have caused
many issues we had in the past with tp->packets_out being
out of sync.

Fixes: c65f7f00c5 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:42 +01:00
Eric Dumazet
28f0d54dbe tcp/dccp: fix possible race __inet_lookup_established()
commit 8dbd76e79a16b45b2ccb01d2f2e08dbf64e71e40 upstream.

Michal Kubecek and Firo Yang did a very nice analysis of crashes
happening in __inet_lookup_established().

Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
(via a close()/socket()/listen() cycle) without a RCU grace period,
I should not have changed listeners linkage in their hash table.

They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
so that a lookup can detect a socket in a hash list was moved in
another one.

Since we added code in commit d296ba60d8 ("soreuseport: Resolve
merge conflict for v4/v6 ordering fix"), we have to add
hlist_nulls_add_tail_rcu() helper.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Firo Yang <firo.yang@suse.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
[stable-4.19: we also need to update code in __inet_lookup_listener() and
 inet6_lookup_listener() which has been removed in 5.0-rc1.]
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:41 +01:00
Hangbin Liu
47353db569 vti: do not confirm neighbor when do pmtu update
[ Upstream commit 8247a79efa2f28b44329f363272550c1738377de ]

When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

Although vti and vti6 are immune to this problem because they are IFF_NOARP
interfaces, as Guillaume pointed. There is still no sense to confirm neighbour
here.

v5: Update commit description.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:39 +01:00
Hangbin Liu
2c06278348 tunnel: do not confirm neighbor when do pmtu update
[ Upstream commit 7a1592bcb15d71400a98632727791d1e68ea0ee8 ]

When do tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

v5: No Change.
v4: Update commit description
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Fixes: 0dec879f63 ("net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP")
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Tested-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:38 +01:00
Hangbin Liu
8bf95f28be net: add bool confirm_neigh parameter for dst_ops.update_pmtu
[ Upstream commit bd085ef678b2cc8c38c105673dfe8ff8f5ec0c57 ]

The MTU update code is supposed to be invoked in response to real
networking events that update the PMTU. In IPv6 PMTU update function
__ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
confirmed time.

But for tunnel code, it will call pmtu before xmit, like:
  - tnl_update_pmtu()
    - skb_dst_update_pmtu()
      - ip6_rt_update_pmtu()
        - __ip6_rt_update_pmtu()
          - dst_confirm_neigh()

If the tunnel remote dst mac address changed and we still do the neigh
confirm, we will not be able to update neigh cache and ping6 remote
will failed.

So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
should not be invoking dst_confirm_neigh() as we have no evidence
of successful two-way communication at this point.

On the other hand it is also important to keep the neigh reachability fresh
for TCP flows, so we cannot remove this dst_confirm_neigh() call.

To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
to choose whether we should do neigh update or not. I will add the parameter
in this patch and set all the callers to true to comply with the previous
way, and fix the tunnel code one by one on later patches.

v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Suggested-by: David Miller <davem@davemloft.net>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:37 +01:00
Antonio Messina
4840b6a748 udp: fix integer overflow while computing available space in sk_rcvbuf
[ Upstream commit feed8a4fc9d46c3126fb9fcae0e9248270c6321a ]

When the size of the receive buffer for a socket is close to 2^31 when
computing if we have enough space in the buffer to copy a packet from
the queue to the buffer we might hit an integer overflow.

When an user set net.core.rmem_default to a value close to 2^31 UDP
packets are dropped because of this overflow. This can be visible, for
instance, with failure to resolve hostnames.

This can be fixed by casting sk_rcvbuf (which is an int) to unsigned
int, similarly to how it is done in TCP.

Signed-off-by: Antonio Messina <amessina@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:36 +01:00
Cambda Zhu
6ed6ab5c5d tcp: Fix highest_sack and highest_sack_seq
[ Upstream commit 853697504de043ff0bfd815bd3a64de1dce73dc7 ]

>From commit 50895b9de1 ("tcp: highest_sack fix"), the logic about
setting tp->highest_sack to the head of the send queue was removed.
Of course the logic is error prone, but it is logical. Before we
remove the pointer to the highest sack skb and use the seq instead,
we need to set tp->highest_sack to NULL when there is no skb after
the last sack, and then replace NULL with the real skb when new skb
inserted into the rtx queue, because the NULL means the highest sack
seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
the next ACK with sack option will increase tp->reordering unexpectedly.

This patch sets tp->highest_sack to the tail of the rtx queue if
it's NULL and new data is sent. The patch keeps the rule that the
highest_sack can only be maintained by sack processing, except for
this only case.

Fixes: 50895b9de1 ("tcp: highest_sack fix")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:36 +01:00
Eric Dumazet
0c78a0e690 net: icmp: fix data-race in cmp_global_allow()
commit bbab7ef235031f6733b5429ae7877bfa22339712 upstream.

This code reads two global variables without protection
of a lock. We need READ_ONCE()/WRITE_ONCE() pairs to
avoid load/store-tearing and better document the intent.

KCSAN reported :
BUG: KCSAN: data-race in icmp_global_allow / icmp_global_allow

read to 0xffffffff861a8014 of 4 bytes by task 11201 on cpu 0:
 icmp_global_allow+0x36/0x1b0 net/ipv4/icmp.c:254
 icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
 icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
 icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
 dst_link_failure include/net/dst.h:419 [inline]
 vti_xmit net/ipv4/ip_vti.c:243 [inline]
 vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
 __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
 netdev_start_xmit include/linux/netdevice.h:4434 [inline]
 xmit_one net/core/dev.c:3280 [inline]
 dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179

write to 0xffffffff861a8014 of 4 bytes by task 11183 on cpu 1:
 icmp_global_allow+0x174/0x1b0 net/ipv4/icmp.c:272
 icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
 icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
 icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
 dst_link_failure include/net/dst.h:419 [inline]
 vti_xmit net/ipv4/ip_vti.c:243 [inline]
 vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
 __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
 netdev_start_xmit include/linux/netdevice.h:4434 [inline]
 xmit_one net/core/dev.c:3280 [inline]
 dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:30 +01:00
Eric Dumazet
610f01f1ab inetpeer: fix data-race in inet_putpeer / inet_putpeer
commit 71685eb4ce80ae9c49eff82ca4dd15acab215de9 upstream.

We need to explicitely forbid read/store tearing in inet_peer_gc()
and inet_putpeer().

The following syzbot report reminds us about inet_putpeer()
running without a lock held.

BUG: KCSAN: data-race in inet_putpeer / inet_putpeer

write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 0:
 inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
 ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
 inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
 rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
 rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0xbb/0xe0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
 native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
 arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x1af/0x280 kernel/sched/idle.c:263

write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 1:
 inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
 ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
 inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
 rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
 rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 4b9d9be839 ("inetpeer: remove unused list")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04 19:13:29 +01:00
Greg Kroah-Hartman
00932fc1fb This is the 4.19.91 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl397JkACgkQONu9yGCS
 aT7ZYhAAt1ooWxtWXrfevz7BJORUOerH0mPGnp9fZHaaxLad5vBcyG1nL1aJHbUF
 zfaWNOIo8btbX3+MLgl7ZPIb20zsG0YaUbburV0RNVS9Y65PETG7F9l9haAxT/f9
 BNw21+WHtClbt1OoFP8VD+4TFAsevXod4fvkFSBsWqU7EDQIXlYfYhThAe3v5gsX
 Ae5i27pAQdTCuEJg5a6Ir2g7mWtD7s2PRdjYL3Y2r2mmMDyUCJO+6BHnfrslD/oC
 b321fRlBH51zyg4L7T75K/DKNs7FosBr2Jo2V/X7Y/Oe+sYIggl67S+CSdIePJ+Y
 X3KVtWnvye4xlQoXBkDrEif+ULlyX8Bn+oIFHpD90WDB9buJA9VcfrOhOfDobeuM
 4QfQKUABKbJ9siNOKm6mWTH/UclBO0Ki13Pi1LQlG+xSTnY3WvbDzxb/FMSOhNwl
 bgN27g+wbjd5EV6IG2F1397FvcPvAZSSw2Gov7irHq9JdSIdvCuoygmQGTW8FOkG
 IS2F/iFNzM63YjeJwLr8cVpvE9lBfqnRcKGsCLeAX6tRULX/DcuDZQmOKY260u6j
 A4QT+JBLBTGXOXbq0Sd8NgoX5Axa1uXOdXsOsBVIYqrDcpz9Uwtg4FvWY+vCurGk
 zItKubQWjkB+iEf1NSfLsRMQenRcDz/g4Q4q/I0TJXabj8k5oaA=
 =eYwk
 -----END PGP SIGNATURE-----

Merge 4.19.91 into android-4.19

Changes in 4.19.91
	inet: protect against too small mtu values.
	mqprio: Fix out-of-bounds access in mqprio_dump
	net: bridge: deny dev_set_mac_address() when unregistering
	net: dsa: fix flow dissection on Tx path
	net: ethernet: ti: cpsw: fix extra rx interrupt
	net: sched: fix dump qlen for sch_mq/sch_mqprio with NOLOCK subqueues
	net: thunderx: start phy before starting autonegotiation
	openvswitch: support asymmetric conntrack
	tcp: md5: fix potential overestimation of TCP option space
	tipc: fix ordering of tipc module init and exit routine
	net/mlx5e: Query global pause state before setting prio2buffer
	tcp: fix rejected syncookies due to stale timestamps
	tcp: tighten acceptance of ACKs not matching a child socket
	tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
	Revert "arm64: preempt: Fix big-endian when checking preempt count in assembly"
	mmc: block: Make card_busy_detect() a bit more generic
	mmc: block: Add CMD13 polling for MMC IOCTLS with R1B response
	PCI/PM: Always return devices to D0 when thawing
	PCI: pciehp: Avoid returning prematurely from sysfs requests
	PCI: Fix Intel ACS quirk UPDCR register address
	PCI/MSI: Fix incorrect MSI-X masking on resume
	PCI: Apply Cavium ACS quirk to ThunderX2 and ThunderX3
	xtensa: fix TLB sanity checker
	rpmsg: glink: Set tail pointer to 0 at end of FIFO
	rpmsg: glink: Fix reuse intents memory leak issue
	rpmsg: glink: Fix use after free in open_ack TIMEOUT case
	rpmsg: glink: Put an extra reference during cleanup
	rpmsg: glink: Fix rpmsg_register_device err handling
	rpmsg: glink: Don't send pending rx_done during remove
	rpmsg: glink: Free pending deferred work on remove
	cifs: smbd: Return -EAGAIN when transport is reconnecting
	cifs: smbd: Add messages on RDMA session destroy and reconnection
	cifs: smbd: Return -EINVAL when the number of iovs exceeds SMBDIRECT_MAX_SGE
	cifs: Don't display RDMA transport on reconnect
	CIFS: Respect O_SYNC and O_DIRECT flags during reconnect
	CIFS: Close open handle after interrupted close
	ARM: dts: s3c64xx: Fix init order of clock providers
	ARM: tegra: Fix FLOW_CTLR_HALT register clobbering by tegra_resume()
	vfio/pci: call irq_bypass_unregister_producer() before freeing irq
	dma-buf: Fix memory leak in sync_file_merge()
	drm: meson: venc: cvbs: fix CVBS mode matching
	dm mpath: remove harmful bio-based optimization
	dm btree: increase rebalance threshold in __rebalance2()
	scsi: iscsi: Fix a potential deadlock in the timeout handler
	scsi: qla2xxx: Change discovery state before PLOGI
	drm/radeon: fix r1xx/r2xx register checker for POT textures
	xhci: fix USB3 device initiated resume race with roothub autosuspend
	Linux 4.19.91

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I974d1578d54f93e1c442e09685ddc2fdf373c441
2019-12-21 11:20:11 +01:00
Eric Dumazet
a148815a32 tcp: md5: fix potential overestimation of TCP option space
[ Upstream commit 9424e2e7ad93ffffa88f882c9bc5023570904b55 ]

Back in 2008, Adam Langley fixed the corner case of packets for flows
having all of the following options : MD5 TS SACK

Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block
can be cooked from the remaining 8 bytes.

tcp_established_options() correctly sets opts->num_sack_blocks
to zero, but returns 36 instead of 32.

This means TCP cooks packets with 4 extra bytes at the end
of options, containing unitialized bytes.

Fixes: 33ad798c92 ("tcp: options clean up")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:57:15 +01:00
Eric Dumazet
d80d67cdaf inet: protect against too small mtu values.
[ Upstream commit 501a90c945103e8627406763dac418f20f3837b2 ]

syzbot was once again able to crash a host by setting a very small mtu
on loopback device.

Let's make inetdev_valid_mtu() available in include/net/ip.h,
and use it in ip_setup_cork(), so that we protect both ip_append_page()
and __ip_append_data()

Also add a READ_ONCE() when the device mtu is read.

Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
even if other code paths might write over this field.

Add a big comment in include/linux/netdevice.h about dev->mtu
needing READ_ONCE()/WRITE_ONCE() annotations.

Hopefully we will add the missing ones in followup patches.

[1]

refcount_t: saturated; leaking memory.
WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 panic+0x2e3/0x75c kernel/panic.c:221
 __warn.cold+0x2f/0x3e kernel/panic.c:582
 report_bug+0x289/0x300 lib/bug.c:195
 fixup_bug arch/x86/kernel/traps.c:174 [inline]
 fixup_bug arch/x86/kernel/traps.c:169 [inline]
 do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
RSP: 0018:ffff88809689f550 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
 refcount_add include/linux/refcount.h:193 [inline]
 skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
 sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
 ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
 udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
 inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
 kernel_sendpage+0x92/0xf0 net/socket.c:3794
 sock_sendpage+0x8b/0xc0 net/socket.c:936
 pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
 splice_from_pipe_feed fs/splice.c:512 [inline]
 __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
 splice_from_pipe+0x108/0x170 fs/splice.c:671
 generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
 do_splice_from fs/splice.c:861 [inline]
 direct_splice_actor+0x123/0x190 fs/splice.c:1035
 splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
 do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
 do_sendfile+0x597/0xd00 fs/read_write.c:1464
 __do_sys_sendfile64 fs/read_write.c:1525 [inline]
 __se_sys_sendfile64 fs/read_write.c:1511 [inline]
 __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441409
Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..

Fixes: 1470ddf7f8 ("inet: Remove explicit write references to sk/inet in ip_append_data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:57:08 +01:00
Greg Kroah-Hartman
d902dae13d This is the 4.19.90 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl35M30ACgkQONu9yGCS
 aT7MxQ/+P2k2knFpbzGfqn7Ug4fyrWJ8T0cvmcQYLxcddJdM+tQWuFfXR6rhg2U6
 cCEkIAKVxihEA51PT6LYiynIMQ1UDAEYENfwYK4inVX2HbMsqDC4D0qnAkABzH27
 sLXwhKOOGB/z1F7oKjjsX/cCwP3V2E0PL1P7owHZis6tB24pZrMEss24x/4+dDm9
 zBDDxpR++mJypRvG3fA8oP5dhZZJNacIvLW+48wrxZWkIcVNnRV+QnyHZe68af1R
 SH4+I12AAeEDyEsQI8yX8PmGAnj1RZrzRQhibxooyBH4642RbX2qCYJkutPjI5rG
 pUl4970MdSHYMyEUwxh77b0jSO/9w7k02yatyp0DVA0PQ7p0lLBFZ96GEG9ytXJm
 Csuc6HEXSSTvuX8pf/KAf18L6kgnUhlxywkDcrcAVLQofMDhODul3fJALmGSVJXW
 jbp6AFoqT84I8Gm+je+vyuQciLnuH5C9wwxrOrWZzr+hLzZk60iG+OpRohn/g+Bx
 PjDjvnump0JGjF89hfNc+v9F+ihz7GBwOxspGrgb27ViRIhcxf0GuYFxyJtEuDiW
 6+gYNzWUaVC4RR1l1jXGWtGUPBsNV50sxFHK/Hx09UMIu/uJPMtF+TW9QDhJT1jr
 kL1kKeCsRV54nWjiWKwTTI2I37xJCPuidW5hvLqf2+ZHYTfQzsE=
 =Op5F
 -----END PGP SIGNATURE-----

Merge 4.19.90 into android-4.19

Changes in 4.19.90
	usb: gadget: configfs: Fix missing spin_lock_init()
	usb: gadget: pch_udc: fix use after free
	scsi: qla2xxx: Fix driver unload hang
	media: venus: remove invalid compat_ioctl32 handler
	USB: uas: honor flag to avoid CAPACITY16
	USB: uas: heed CAPACITY_HEURISTICS
	USB: documentation: flags on usb-storage versus UAS
	usb: Allow USB device to be warm reset in suspended state
	staging: rtl8188eu: fix interface sanity check
	staging: rtl8712: fix interface sanity check
	staging: gigaset: fix general protection fault on probe
	staging: gigaset: fix illegal free on probe errors
	staging: gigaset: add endpoint-type sanity check
	usb: xhci: only set D3hot for pci device
	xhci: Fix memory leak in xhci_add_in_port()
	xhci: Increase STS_HALT timeout in xhci_suspend()
	xhci: handle some XHCI_TRUST_TX_LENGTH quirks cases as default behaviour.
	ARM: dts: pandora-common: define wl1251 as child node of mmc3
	iio: adis16480: Add debugfs_reg_access entry
	iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting
	iio: imu: inv_mpu6050: fix temperature reporting using bad unit
	USB: atm: ueagle-atm: add missing endpoint check
	USB: idmouse: fix interface sanity checks
	USB: serial: io_edgeport: fix epic endpoint lookup
	usb: roles: fix a potential use after free
	USB: adutux: fix interface sanity check
	usb: core: urb: fix URB structure initialization function
	usb: mon: Fix a deadlock in usbmon between mmap and read
	tpm: add check after commands attribs tab allocation
	mtd: spear_smi: Fix Write Burst mode
	virtio-balloon: fix managed page counts when migrating pages between zones
	usb: dwc3: pci: add ID for the Intel Comet Lake -H variant
	usb: dwc3: gadget: Fix logical condition
	usb: dwc3: ep0: Clear started flag on completion
	phy: renesas: rcar-gen3-usb2: Fix sysfs interface of "role"
	btrfs: check page->mapping when loading free space cache
	btrfs: use refcount_inc_not_zero in kill_all_nodes
	Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc
	Btrfs: fix negative subv_writers counter and data space leak after buffered write
	btrfs: Avoid getting stuck during cyclic writebacks
	btrfs: Remove btrfs_bio::flags member
	Btrfs: send, skip backreference walking for extents with many references
	btrfs: record all roots for rename exchange on a subvol
	rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address
	rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer
	rtlwifi: rtl8192de: Fix missing enable interrupt flag
	lib: raid6: fix awk build warnings
	ovl: fix corner case of non-unique st_dev;st_ino
	ovl: relax WARN_ON() on rename to self
	hwrng: omap - Fix RNG wait loop timeout
	dm writecache: handle REQ_FUA
	dm zoned: reduce overhead of backing device checks
	workqueue: Fix spurious sanity check failures in destroy_workqueue()
	workqueue: Fix pwq ref leak in rescuer_thread()
	ASoC: rt5645: Fixed buddy jack support.
	ASoC: rt5645: Fixed typo for buddy jack support.
	ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report
	md: improve handling of bio with REQ_PREFLUSH in md_flush_request()
	blk-mq: avoid sysfs buffer overflow with too many CPU cores
	cgroup: pids: use atomic64_t for pids->limit
	ar5523: check NULL before memcpy() in ar5523_cmd()
	s390/mm: properly clear _PAGE_NOEXEC bit when it is not supported
	media: bdisp: fix memleak on release
	media: radio: wl1273: fix interrupt masking on release
	media: cec.h: CEC_OP_REC_FLAG_ values were swapped
	cpuidle: Do not unset the driver if it is there already
	erofs: zero out when listxattr is called with no xattr
	intel_th: Fix a double put_device() in error path
	intel_th: pci: Add Ice Lake CPU support
	intel_th: pci: Add Tiger Lake CPU support
	PM / devfreq: Lock devfreq in trans_stat_show
	cpufreq: powernv: fix stack bloat and hard limit on number of CPUs
	ACPI / hotplug / PCI: Allocate resources directly under the non-hotplug bridge
	ACPI: OSL: only free map once in osl.c
	ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data()
	ACPI: PM: Avoid attaching ACPI PM domain to certain devices
	pinctrl: armada-37xx: Fix irq mask access in armada_37xx_irq_set_type()
	pinctrl: samsung: Add of_node_put() before return in error path
	pinctrl: samsung: Fix device node refcount leaks in Exynos wakeup controller init
	pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init
	pinctrl: samsung: Fix device node refcount leaks in init code
	pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init
	mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card
	ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity
	ppdev: fix PPGETTIME/PPSETTIME ioctls
	powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB
	powerpc/xive: Prevent page fault issues in the machine crash handler
	powerpc: Allow flush_icache_range to work across ranges >4GB
	powerpc/xive: Skip ioremap() of ESB pages for LSI interrupts
	video/hdmi: Fix AVI bar unpack
	quota: Check that quota is not dirty before release
	ext2: check err when partial != NULL
	quota: fix livelock in dquot_writeback_dquots
	ext4: Fix credit estimate for final inode freeing
	reiserfs: fix extended attributes on the root directory
	block: fix single range discard merge
	scsi: zfcp: trace channel log even for FCP command responses
	scsi: qla2xxx: Fix DMA unmap leak
	scsi: qla2xxx: Fix hang in fcport delete path
	scsi: qla2xxx: Fix session lookup in qlt_abort_work()
	scsi: qla2xxx: Fix qla24xx_process_bidir_cmd()
	scsi: qla2xxx: Always check the qla2x00_wait_for_hba_online() return value
	scsi: qla2xxx: Fix message indicating vectors used by driver
	scsi: qla2xxx: Fix SRB leak on switch command timeout
	xhci: make sure interrupts are restored to correct state
	usb: typec: fix use after free in typec_register_port()
	omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251
	scsi: lpfc: Cap NPIV vports to 256
	scsi: lpfc: Correct code setting non existent bits in sli4 ABORT WQE
	scsi: lpfc: Correct topology type reporting on G7 adapters
	drbd: Change drbd_request_detach_interruptible's return type to int
	e100: Fix passing zero to 'PTR_ERR' warning in e100_load_ucode_wait
	pvcalls-front: don't return error when the ring is full
	sch_cake: Correctly update parent qlen when splitting GSO packets
	net/smc: do not wait under send_lock
	net: hns3: clear pci private data when unload hns3 driver
	net: hns3: change hnae3_register_ae_dev() to int
	net: hns3: Check variable is valid before assigning it to another
	scsi: hisi_sas: send primitive NOTIFY to SSP situation only
	scsi: hisi_sas: Reject setting programmed minimum linkrate > 1.5G
	x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
	x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
	power: supply: cpcap-battery: Fix signed counter sample register
	mlxsw: spectrum_router: Refresh nexthop neighbour when it becomes dead
	media: vimc: fix component match compare
	ath10k: fix fw crash by moving chip reset after napi disabled
	regulator: 88pm800: fix warning same module names
	powerpc: Avoid clang warnings around setjmp and longjmp
	powerpc: Fix vDSO clock_getres()
	ext4: work around deleting a file with i_nlink == 0 safely
	firmware: qcom: scm: Ensure 'a0' status code is treated as signed
	mm/shmem.c: cast the type of unmap_start to u64
	rtc: disable uie before setting time and enable after
	splice: only read in as much information as there is pipe buffer space
	ext4: fix a bug in ext4_wait_for_tail_page_commit
	mfd: rk808: Fix RK818 ID template
	mm, thp, proc: report THP eligibility for each vma
	s390/smp,vdso: fix ASCE handling
	blk-mq: make sure that line break can be printed
	workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
	perf callchain: Fix segfault in thread__resolve_callchain_sample()
	gre: refetch erspan header from skb->data after pskb_may_pull()
	firmware: arm_scmi: Avoid double free in error flow
	sunrpc: fix crash when cache_head become valid before update
	net/mlx5e: Fix SFF 8472 eeprom length
	leds: trigger: netdev: fix handling on interface rename
	PCI: rcar: Fix missing MACCTLR register setting in initialization sequence
	gfs2: fix glock reference problem in gfs2_trans_remove_revoke
	of: overlay: add_changeset_property() memory leak
	kernel/module.c: wakeup processes in module_wq on module unload
	cifs: Fix potential softlockups while refreshing DFS cache
	gpiolib: acpi: Add Terra Pad 1061 to the run_edge_events_on_boot_blacklist
	raid5: need to set STRIPE_HANDLE for batch head
	scsi: qla2xxx: Change discovery state before PLOGI
	iio: imu: mpu6050: add missing available scan masks
	idr: Fix idr_get_next_ul race with idr_remove
	scsi: zorro_esp: Limit DMA transfers to 65536 bytes (except on Fastlane)
	of: unittest: fix memory leak in attach_node_and_children
	Linux 4.19.90

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I790291e9f3d3c8dd3f53e4387de25ff272ad4f39
2019-12-18 09:03:30 +01:00
Cong Wang
f7654ebe92 gre: refetch erspan header from skb->data after pskb_may_pull()
[ Upstream commit 0e4940928c26527ce8f97237fef4c8a91cd34207 ]

After pskb_may_pull() we should always refetch the header
pointers from the skb->data in case it got reallocated.

In gre_parse_header(), the erspan header is still fetched
from the 'options' pointer which is fetched before
pskb_may_pull().

Found this during code review of a KMSAN bug report.

Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup")
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Acked-by: William Tu <u9012063@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17 20:35:51 +01:00
Greg Kroah-Hartman
e5312e5d68 This is the 4.19.89 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl3zQ1wACgkQONu9yGCS
 aT6ZDA/+JQyM+mgrU2t5mkq9lXCwL87Jiooy0kKT9b/2EWmW5Gdxp/On9PXfqtfs
 uZ+v0A1g1H+582uwuqG1wB2jr3I2AhNnRNbvSypGtk1Kitx9HqVJD/wWRRVCULww
 cr3uA/ZOX+deRjOVYP3dhFp7ycn6u5+GxgmFQTLmKAYN8uUqq4/dpWy01iB0nr2A
 GcoLm9P96o8P/wIWaykqOvshDrocbFcBL4VuxLeZCbFsAMTiX+jJnyIL8W7gfBJl
 M2626S/hESk5DvGcMN3zwOw/nTJlvySUtfqXSvPk0sT90UMx/YZ9QdpS9GkvRb9t
 OA1G+iHguEU+Fq/DawUyxwk/kt3nA6cg0q7RSxHo7QP6SGo7OaHHS1myzGDhL8oc
 LDKXO2iSSzvXJDlqrU45N+1YhpeiIHCxmDctbUIM9dP4u6wWmQIyYXLrcpupTsm9
 StiDBguXFHWSBFhG0+MlTUU5cypVNoN+56wBAUTR6+qoDASTzGvjNbrBsQihODV0
 RMFJF17Zvn+UoEohe860EMswUBsJ+F+VSZO5yGuZgsaC/2Ih6M1dxsiNU7RF02gX
 fRis6huj1+642ZsEbd2tueYGUaDN1HpMsVkN3AAkD3pJF5lX7AJRwhvRyC8N1jhc
 G90KMSk2pR/ItjmUpkKaAhAKhN+oKSzuCPpHj2iGotfWdd4slXQ=
 =Ekyt
 -----END PGP SIGNATURE-----

Merge 4.19.89 into android-4.19

Changes in 4.19.89
	rsi: release skb if rsi_prepare_beacon fails
	arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator
	sparc64: implement ioremap_uc
	lp: fix sparc64 LPSETTIMEOUT ioctl
	usb: gadget: u_serial: add missing port entry locking
	tty: serial: fsl_lpuart: use the sg count from dma_map_sg
	tty: serial: msm_serial: Fix flow control
	serial: pl011: Fix DMA ->flush_buffer()
	serial: serial_core: Perform NULL checks for break_ctl ops
	serial: ifx6x60: add missed pm_runtime_disable
	autofs: fix a leak in autofs_expire_indirect()
	RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LEN
	iwlwifi: pcie: don't consider IV len in A-MSDU
	exportfs_decode_fh(): negative pinned may become positive without the parent locked
	audit_get_nd(): don't unlock parent too early
	NFC: nxp-nci: Fix NULL pointer dereference after I2C communication error
	xfrm: release device reference for invalid state
	Input: cyttsp4_core - fix use after free bug
	sched/core: Avoid spurious lock dependencies
	perf/core: Consistently fail fork on allocation failures
	ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
	drm/sun4i: tcon: Set min division of TCON0_DCLK to 1.
	selftests: kvm: fix build with glibc >= 2.30
	rsxx: add missed destroy_workqueue calls in remove
	net: ep93xx_eth: fix mismatch of request_mem_region in remove
	i2c: core: fix use after free in of_i2c_notify
	serial: core: Allow processing sysrq at port unlock time
	cxgb4vf: fix memleak in mac_hlist initialization
	iwlwifi: mvm: synchronize TID queue removal
	iwlwifi: trans: Clear persistence bit when starting the FW
	iwlwifi: mvm: Send non offchannel traffic via AP sta
	ARM: 8813/1: Make aligned 2-byte getuser()/putuser() atomic on ARMv6+
	audit: Embed key into chunk
	netfilter: nf_tables: don't use position attribute on rule replacement
	ARC: IOC: panic if kernel was started with previously enabled IOC
	net/mlx5: Release resource on error flow
	clk: sunxi-ng: a64: Fix gate bit of DSI DPHY
	ice: Fix NVM mask defines
	dlm: fix possible call to kfree() for non-initialized pointer
	ARM: dts: exynos: Fix LDO13 min values on Odroid XU3/XU4/HC1
	extcon: max8997: Fix lack of path setting in USB device mode
	net: ethernet: ti: cpts: correct debug for expired txq skb
	rtc: s3c-rtc: Avoid using broken ALMYEAR register
	rtc: max77686: Fix the returned value in case of error in 'max77686_rtc_read_time()'
	i40e: don't restart nway if autoneg not supported
	virtchnl: Fix off by one error
	clk: rockchip: fix rk3188 sclk_smc gate data
	clk: rockchip: fix rk3188 sclk_mac_lbtest parameter ordering
	ARM: dts: rockchip: Fix rk3288-rock2 vcc_flash name
	dlm: fix missing idr_destroy for recover_idr
	MIPS: SiByte: Enable ZONE_DMA32 for LittleSur
	net: dsa: mv88e6xxx: Work around mv886e6161 SERDES missing MII_PHYSID2
	scsi: zfcp: update kernel message for invalid FCP_CMND length, it's not the CDB
	scsi: zfcp: drop default switch case which might paper over missing case
	drivers: soc: Allow building the amlogic drivers without ARCH_MESON
	bus: ti-sysc: Fix getting optional clocks in clock_roles
	ARM: dts: imx6: RDU2: fix eGalax touchscreen node
	crypto: ecc - check for invalid values in the key verification test
	crypto: bcm - fix normal/non key hash algorithm failure
	arm64: dts: zynqmp: Fix node names which contain "_"
	pinctrl: qcom: ssbi-gpio: fix gpio-hog related boot issues
	Staging: iio: adt7316: Fix i2c data reading, set the data field
	firmware: raspberrypi: Fix firmware calls with large buffers
	mm/vmstat.c: fix NUMA statistics updates
	clk: rockchip: fix I2S1 clock gate register for rk3328
	clk: rockchip: fix ID of 8ch clock of I2S1 for rk3328
	sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit
	regulator: Fix return value of _set_load() stub
	USB: serial: f81534: fix reading old/new IC config
	xfs: extent shifting doesn't fully invalidate page cache
	net-next/hinic:fix a bug in set mac address
	net-next/hinic: fix a bug in rx data flow
	ice: Fix return value from NAPI poll
	ice: Fix possible NULL pointer de-reference
	iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
	iomap: sub-block dio needs to zeroout beyond EOF
	iomap: dio data corruption and spurious errors when pipes fill
	iomap: readpages doesn't zero page tail beyond EOF
	iw_cxgb4: only reconnect with MPAv1 if the peer aborts
	MIPS: OCTEON: octeon-platform: fix typing
	net/smc: use after free fix in smc_wr_tx_put_slot()
	math-emu/soft-fp.h: (_FP_ROUND_ZERO) cast 0 to void to fix warning
	nds32: Fix the items of hwcap_str ordering issue.
	rtc: max8997: Fix the returned value in case of error in 'max8997_rtc_read_alarm()'
	rtc: dt-binding: abx80x: fix resistance scale
	ARM: dts: exynos: Use Samsung SoC specific compatible for DWC2 module
	media: coda: fix memory corruption in case more than 32 instances are opened
	media: pulse8-cec: return 0 when invalidating the logical address
	media: cec: report Vendor ID after initialization
	iwlwifi: fix cfg structs for 22000 with different RF modules
	ravb: Clean up duplex handling
	net/ipv6: re-do dad when interface has IFF_NOARP flag change
	dmaengine: coh901318: Fix a double-lock bug
	dmaengine: coh901318: Remove unused variable
	dmaengine: dw-dmac: implement dma protection control setting
	net: qualcomm: rmnet: move null check on dev before dereferecing it
	selftests/powerpc: Allocate base registers
	selftests/powerpc: Skip test instead of failing
	usb: dwc3: debugfs: Properly print/set link state for HS
	usb: dwc3: don't log probe deferrals; but do log other error codes
	ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()
	f2fs: fix to account preflush command for noflush_merge mode
	f2fs: fix count of seg_freed to make sec_freed correct
	f2fs: change segment to section in f2fs_ioc_gc_range
	ARM: dts: rockchip: Fix the PMU interrupt number for rv1108
	ARM: dts: rockchip: Assign the proper GPIO clocks for rv1108
	f2fs: fix to allow node segment for GC by ioctl path
	sparc: Fix JIT fused branch convergance.
	sparc: Correct ctx->saw_frame_pointer logic.
	nvme: Free ctrl device name on init failure
	dma-mapping: fix return type of dma_set_max_seg_size()
	slimbus: ngd: Fix build error on x86
	altera-stapl: check for a null key before strcasecmp'ing it
	serial: imx: fix error handling in console_setup
	i2c: imx: don't print error message on probe defer
	clk: meson: Fix GXL HDMI PLL fractional bits width
	gpu: host1x: Fix syncpoint ID field size on Tegra186
	lockd: fix decoding of TEST results
	sctp: increase sk_wmem_alloc when head->truesize is increased
	iommu/amd: Fix line-break in error log reporting
	ASoC: rsnd: tidyup registering method for rsnd_kctrl_new()
	ARM: dts: sun4i: Fix gpio-keys warning
	ARM: dts: sun4i: Fix HDMI output DTC warning
	ARM: dts: sun5i: a10s: Fix HDMI output DTC warning
	ARM: dts: r8a779[01]: Disable unconnected LVDS encoders
	ARM: dts: sun7i: Fix HDMI output DTC warning
	ARM: dts: sun8i: a23/a33: Fix OPP DTC warnings
	ARM: dts: sun8i: v3s: Change pinctrl nodes to avoid warning
	dlm: NULL check before kmem_cache_destroy is not needed
	ARM: debug: enable UART1 for socfpga Cyclone5
	can: xilinx: fix return type of ndo_start_xmit function
	nfsd: fix a warning in __cld_pipe_upcall()
	bpf: btf: implement btf_name_valid_identifier()
	bpf: btf: check name validity for various types
	tools: bpftool: fix a bitfield pretty print issue
	ASoC: au8540: use 64-bit arithmetic instead of 32-bit
	ARM: OMAP1/2: fix SoC name printing
	arm64: dts: meson-gxl-libretech-cc: fix GPIO lines names
	arm64: dts: meson-gxbb-nanopi-k2: fix GPIO lines names
	arm64: dts: meson-gxbb-odroidc2: fix GPIO lines names
	arm64: dts: meson-gxl-khadas-vim: fix GPIO lines names
	net/x25: fix called/calling length calculation in x25_parse_address_block
	net/x25: fix null_x25_address handling
	tools/bpf: make libbpf _GNU_SOURCE friendly
	clk: mediatek: Drop __init from mtk_clk_register_cpumuxes()
	clk: mediatek: Drop more __init markings for driver probe
	soc: renesas: r8a77970-sysc: Correct names of A2DP/A2CN power domains
	soc: renesas: r8a77980-sysc: Correct names of A2DP[01] power domains
	soc: renesas: r8a77980-sysc: Correct A3VIP[012] power domain hierarchy
	kbuild: disable dtc simple_bus_reg warnings by default
	tcp: make tcp_space() aware of socket backlog
	ARM: dts: mmp2: fix the gpio interrupt cell number
	ARM: dts: realview-pbx: Fix duplicate regulator nodes
	tcp: fix off-by-one bug on aborting window-probing socket
	tcp: fix SNMP under-estimation on failed retransmission
	tcp: fix SNMP TCP timeout under-estimation
	modpost: skip ELF local symbols during section mismatch check
	kbuild: fix single target build for external module
	mtd: fix mtd_oobavail() incoherent returned value
	ARM: dts: pxa: clean up USB controller nodes
	clk: meson: meson8b: fix the offset of vid_pll_dco's N value
	clk: sunxi-ng: h3/h5: Fix CSI_MCLK parent
	clk: qcom: Fix MSM8998 resets
	media: cxd2880-spi: fix probe when dvb_attach fails
	ARM: dts: realview: Fix some more duplicate regulator nodes
	dlm: fix invalid cluster name warning
	net/mlx4_core: Fix return codes of unsupported operations
	pstore/ram: Avoid NULL deref in ftrace merging failure path
	powerpc/math-emu: Update macros from GCC
	clk: renesas: r8a77990: Correct parent clock of DU
	clk: renesas: r8a77995: Correct parent clock of DU
	MIPS: OCTEON: cvmx_pko_mem_debug8: use oldest forward compatible definition
	nfsd: Return EPERM, not EACCES, in some SETATTR cases
	media: uvcvideo: Abstract streaming object lifetime
	tty: serial: qcom_geni_serial: Fix softlock
	ARM: dts: sun8i: h3: Fix the system-control register range
	tty: Don't block on IO when ldisc change is pending
	media: stkwebcam: Bugfix for wrong return values
	firmware: qcom: scm: fix compilation error when disabled
	clk: qcom: gcc-msm8998: Disable halt check of UFS clocks
	sctp: frag_point sanity check
	soc: renesas: r8a77990-sysc: Fix initialization order of 3DG-{A,B}
	mlxsw: spectrum_router: Relax GRE decap matching check
	IB/hfi1: Ignore LNI errors before DC8051 transitions to Polling state
	IB/hfi1: Close VNIC sdma_progress sleep window
	mlx4: Use snprintf instead of complicated strcpy
	usb: mtu3: fix dbginfo in qmu_tx_zlp_error_handler
	clk: renesas: rcar-gen3: Set state when registering SD clocks
	ASoC: max9867: Fix power management
	ARM: dts: sunxi: Fix PMU compatible strings
	ARM: dts: am335x-pdu001: Fix polarity of card detection input
	media: vimc: fix start stream when link is disabled
	net: aquantia: fix RSS table and key sizes
	sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
	fuse: verify nlink
	fuse: verify attributes
	ALSA: hda/realtek - Enable internal speaker of ASUS UX431FLC
	ALSA: hda/realtek - Enable the headset-mic on a Xiaomi's laptop
	ALSA: hda/realtek - Dell headphone has noise on unmute for ALC236
	ALSA: pcm: oss: Avoid potential buffer overflows
	ALSA: hda - Add mute led support for HP ProBook 645 G4
	Input: synaptics - switch another X1 Carbon 6 to RMI/SMbus
	Input: synaptics-rmi4 - re-enable IRQs in f34v7_do_reflash
	Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers
	Input: goodix - add upside-down quirk for Teclast X89 tablet
	coresight: etm4x: Fix input validation for sysfs.
	Input: Fix memory leak in psxpad_spi_probe
	x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all()
	x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect
	xfrm interface: fix memory leak on creation
	xfrm interface: avoid corruption on changelink
	xfrm interface: fix list corruption for x-netns
	xfrm interface: fix management of phydev
	CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
	CIFS: Fix SMB2 oplock break processing
	tty: vt: keyboard: reject invalid keycodes
	can: slcan: Fix use-after-free Read in slcan_open
	kernfs: fix ino wrap-around detection
	jbd2: Fix possible overflow in jbd2_log_space_left()
	drm/msm: fix memleak on release
	drm/i810: Prevent underflow in ioctl
	arm64: dts: exynos: Revert "Remove unneeded address space mapping for soc node"
	KVM: arm/arm64: vgic: Don't rely on the wrong pending table
	KVM: x86: do not modify masked bits of shared MSRs
	KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
	KVM: x86: Grab KVM's srcu lock when setting nested state
	crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr
	crypto: atmel-aes - Fix IV handling when req->nbytes < ivsize
	crypto: af_alg - cast ki_complete ternary op to int
	crypto: ccp - fix uninitialized list head
	crypto: ecdh - fix big endian bug in ECC library
	crypto: user - fix memory leak in crypto_report
	spi: atmel: Fix CS high support
	mwifiex: update set_mac_address logic
	can: ucan: fix non-atomic allocation in completion handler
	RDMA/qib: Validate ->show()/store() callbacks before calling them
	iomap: Fix pipe page leakage during splicing
	thermal: Fix deadlock in thermal thermal_zone_device_check
	vcs: prevent write access to vcsu devices
	binder: Fix race between mmap() and binder_alloc_print_pages()
	binder: Handle start==NULL in binder_update_page_range()
	ALSA: hda - Fix pending unsol events at shutdown
	md/raid0: Fix an error message in raid0_make_request()
	watchdog: aspeed: Fix clock behaviour for ast2600
	perf script: Fix invalid LBR/binary mismatch error
	splice: don't read more than available pipe space
	iomap: partially revert 4721a601099 (simulated directio short read on EFAULT)
	xfs: add missing error check in xfs_prepare_shift()
	ASoC: rsnd: fixup MIX kctrl registration
	KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
	net: qrtr: fix memort leak in qrtr_tun_write_iter
	appletalk: Fix potential NULL pointer dereference in unregister_snap_client
	appletalk: Set error code if register_snap_client failed
	Linux 4.19.89

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fa59adde9a7e9a6d4684de0e95de14a8b83d0b
2019-12-13 10:01:10 +01:00
Yuchung Cheng
01775a8dc6 tcp: fix SNMP TCP timeout under-estimation
[ Upstream commit e1561fe2dd69dc5dddd69bd73aa65355bdfb048b ]

Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
1. It counts all SYN and SYN-ACK timeouts
2. It counts timeouts in other states except recurring timeouts and
   timeouts after fast recovery or disorder state.

Such selective accounting makes analysis difficult and complicated. For
example the monitoring system needs to collect many other SNMP counters
to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
counter simply counts all the retransmit timeout (SYN or data or FIN).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-13 08:52:20 +01:00
Yuchung Cheng
a5f37a687f tcp: fix SNMP under-estimation on failed retransmission
[ Upstream commit ec641b39457e17774313b66697a8a1dc070257bd ]

Previously the SNMP counter LINUX_MIB_TCPRETRANSFAIL is not counting
the TSO/GSO properly on failed retransmission. This patch fixes that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-13 08:52:20 +01:00
Yuchung Cheng
85b03cfe31 tcp: fix off-by-one bug on aborting window-probing socket
[ Upstream commit 3976535af0cb9fe34a55f2ffb8d7e6b39a2f8188 ]

Previously there is an off-by-one bug on determining when to abort
a stalled window-probing socket. This patch fixes that so it is
consistent with tcp_write_timeout().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-13 08:52:19 +01:00
Sami Tolvanen
f0723e3bd5 ANDROID: netfilter: nf_nat: remove static from nf_nat_ipv4_fn
Exporting a static function breaks the ABI toolchain when CFI is
enabled. Remove the static identifier to fix the issue.

Bug: 145210207
Change-Id: Ib23d594ae05718dcf119b8ed0b5a8251aee2561c
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2019-12-10 10:02:54 -08:00
Yuchung Cheng
d832269ac6 tcp: exit if nothing to retransmit on RTO timeout
commit 88f8598d0a302a08380eadefd09b9f5cb1c4c428 upstream.

Previously TCP only warns if its RTO timer fires and the
retransmission queue is empty, but it'll cause null pointer
reference later on. It's better to avoid such catastrophic failure
and simply exit with a warning.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-05 09:21:34 +01:00
Lorenzo Bianconi
b3c81acf23 net: ip_gre: do not report erspan_ver for gre or gretap
[ Upstream commit 2bdf700e538828d6456150b9319e5f689b062d54 ]

Report erspan version field to userspace in ipgre_fill_info just for
erspan tunnels. The issue can be triggered with the following reproducer:

$ip link add name gre1 type gre local 192.168.0.1 remote 192.168.1.1
$ip link set dev gre1 up
$ip -d link sh gre1
13: gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/gre 192.168.0.1 peer 192.168.1.1 promiscuity 0 minmtu 0 maxmtu 0
    gre remote 192.168.1.1 local 192.168.0.1 ttl inherit erspan_ver 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1

Fixes: f551c91de2 ("net: erspan: introduce erspan v2 for ip_gre")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05 09:21:10 +01:00
wenxu
9f1ac49402 ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel
[ Upstream commit d71b57532d70c03f4671dd04e84157ac6bf021b0 ]

ip l add dev tun type gretap key 1000
ip a a dev tun 10.0.0.1/24

Packets with tun-id 1000 can be recived by tun dev. But packet can't
be sent through dev tun for non-tunnel-dst

With this patch: tunnel-dst can be get through lwtunnel like beflow:
ip r a 10.0.0.7 encap ip dst 172.168.0.11 dev tun

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05 09:21:07 +01:00
Taehee Yoo
e6c6c0439a net: bpfilter: fix iptables failure if bpfilter_umh is disabled
[ Upstream commit 97adaddaa6db7a8af81b9b11e30cbe3628cd6700 ]

When iptables command is executed, ip_{set/get}sockopt() try to upload
bpfilter.ko if bpfilter is enabled. if it couldn't find bpfilter.ko,
command is failed.
bpfilter.ko is generated if CONFIG_BPFILTER_UMH is enabled.
ip_{set/get}sockopt() only checks CONFIG_BPFILTER.
So that if CONFIG_BPFILTER is enabled and CONFIG_BPFILTER_UMH is disabled,
iptables command is always failed.

test config:
   CONFIG_BPFILTER=y
   # CONFIG_BPFILTER_UMH is not set

test command:
   %iptables -L
   iptables: No chain/target/match by that name.

Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01 09:17:18 +01:00
Hangbin Liu
32d7474b7a ipv4/igmp: fix v1/v2 switchback timeout based on rfc3376, 8.12
[ Upstream commit 966c37f2d77eb44d47af8e919267b1ba675b2eca ]

Similiar with ipv6 mcast commit 89225d1ce6 ("net: ipv6: mld: fix v1/v2
switchback timeout to rfc3810, 9.12.")

i) RFC3376 8.12. Older Version Querier Present Timeout says:

   The Older Version Querier Interval is the time-out for transitioning
   a host back to IGMPv3 mode once an older version query is heard.
   When an older version query is received, hosts set their Older
   Version Querier Present Timer to Older Version Querier Interval.

   This value MUST be ((the Robustness Variable) times (the Query
   Interval in the last Query received)) plus (one Query Response
   Interval).

Currently we only use a hardcode value IGMP_V1/v2_ROUTER_PRESENT_TIMEOUT.
Fix it by adding two new items mr_qi(Query Interval) and mr_qri(Query Response
Interval) in struct in_device.

Now we can calculate the switchback time via (mr_qrv * mr_qi) + mr_qri.
We need update these values when receive IGMPv3 queries.

Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01 09:17:05 +01:00
Yuchung Cheng
6491a2d26c tcp: start receiver buffer autotuning sooner
[ Upstream commit 041a14d2671573611ffd6412bc16e2f64469f7fb ]

Previously receiver buffer auto-tuning starts after receiving
one advertised window amount of data. After the initial receiver
buffer was raised by patch a337531b942b ("tcp: up initial rmem to
128KB and SYN rwin to around 64KB"), the reciver buffer may take
too long to start raising. To address this issue, this patch lowers
the initial bytes expected to receive roughly the expected sender's
initial window.

Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24 08:21:06 +01:00
Yuchung Cheng
43876b1ce4 tcp: up initial rmem to 128KB and SYN rwin to around 64KB
[ Upstream commit a337531b942bd8a03e7052444d7e36972aac2d92 ]

Previously TCP initial receive buffer is ~87KB by default and
the initial receive window is ~29KB (20 MSS). This patch changes
the two numbers to 128KB and ~64KB (rounding down to the multiples
of MSS) respectively. The patch also simplifies the calculations s.t.
the two numbers are directly controlled by sysctl tcp_rmem[1]:

  1) Initial receiver buffer budget (sk_rcvbuf): while this should
     be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
     always override and set a larger size when a new connection
     establishes.

  2) Initial receive window in SYN: previously it is set to 20
     packets if MSS <= 1460. The number 20 was based on the initial
     congestion window of 10: the receiver needs twice amount to
     avoid being limited by the receive window upon out-of-order
     delivery in the first window burst. But since this only
     applies if the receiving MSS <= 1460, connection using large MTU
     (e.g. to utilize receiver zero-copy) may be limited by the
     receive window.

With this patch TCP memory configuration is more straight-forward and
more properly sized to modern high-speed networks by default. Several
popular stacks have been announcing 64KB rwin in SYNs as well.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24 08:19:23 +01:00
Tan Hu
8ddec6aaad netfilter: masquerade: don't flush all conntracks if only one address deleted on device
[ Upstream commit 097f95d319f817e651bd51f8846aced92a55a6a1 ]

We configured iptables as below, which only allowed incoming data on
established connections:

iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT
iptables -t mangle -P PREROUTING DROP

When deleting a secondary address, current masquerade implements would
flush all conntracks on this device. All the established connections on
primary address also be deleted, then subsequent incoming data on the
connections would be dropped wrongly because it was identified as NEW
connection.

So when an address was delete, it should only flush connections related
with the address.

Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20 18:47:52 +01:00
Haishuang Yan
6745591c8d ip_gre: fix parsing gre header in ipgre_err
[ Upstream commit b0350d51f001e6edc13ee4f253b98b50b05dd401 ]

gre_parse_header stops parsing when csum_err is encountered, which means
tpi->key is undefined and ip_tunnel_lookup will return NULL improperly.

This patch introduce a NULL pointer as csum_err parameter. Even when
csum_err is encountered, it won't return error and continue parsing gre
header as expected.

Fixes: 9f57c67c37 ("gre: Remove support for sharing GRE protocol hook.")
Reported-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20 18:46:48 +01:00
Guillaume Nault
66daa05750 ipmr: Fix skb headroom in ipmr_get_route().
[ Upstream commit 7901cd97963d6cbde88fa25a4a446db3554c16c6 ]

In route.c, inet_rtm_getroute_build_skb() creates an skb with no
headroom. This skb is then used by inet_rtm_getroute() which may pass
it to rt_fill_info() and, from there, to ipmr_get_route(). The later
might try to reuse this skb by cloning it and prepending an IPv4
header. But since the original skb has no headroom, skb_push() triggers
skb_under_panic():

skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:108!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
RIP: 0010:skb_panic+0xbf/0xd0
Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286
RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc
RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0
R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000
R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0
FS:  00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 skb_push+0x7e/0x80
 ipmr_get_route+0x459/0x6fa
 rt_fill_info+0x692/0x9f0
 inet_rtm_getroute+0xd26/0xf20
 rtnetlink_rcv_msg+0x45d/0x630
 netlink_rcv_skb+0x1a5/0x220
 rtnetlink_rcv+0x15/0x20
 netlink_unicast+0x305/0x3a0
 netlink_sendmsg+0x575/0x730
 sock_sendmsg+0xb5/0xc0
 ___sys_sendmsg+0x497/0x4f0
 __sys_sendmsg+0xcb/0x150
 __x64_sys_sendmsg+0x48/0x50
 do_syscall_64+0xd2/0xac0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Actually the original skb used to have enough headroom, but the
reserve_skb() call was lost with the introduction of
inet_rtm_getroute_build_skb() by commit 404eb77ea7 ("ipv4: support
sport, dport and ip_proto in RTM_GETROUTE").

We could reserve some headroom again in inet_rtm_getroute_build_skb(),
but this function shouldn't be responsible for handling the special
case of ipmr_get_route(). Let's handle that directly in
ipmr_get_route() by calling skb_realloc_headroom() instead of
skb_clone().

Fixes: 404eb77ea7 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-20 18:45:11 +01:00
David Ahern
88f8c39912 ipv4: Fix table id reference in fib_sync_down_addr
[ Upstream commit e0a312629fefa943534fc46f7bfbe6de3fdaf463 ]

Hendrik reported routes in the main table using source address are not
removed when the address is removed. The problem is that fib_sync_down_addr
does not account for devices in the default VRF which are associated
with the main table. Fix by updating the table id reference.

Fixes: 5a56a0b3a4 ("net: Don't delete routes in different VRFs")
Reported-by: Hendrik Donner <hd@os-cillation.de>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12 19:20:27 +01:00
Paolo Abeni
b166e8838a ipv4: fix route update on metric change.
[ Upstream commit 0b834ba00ab5337e938c727e216e1f5249794717 ]

Since commit af4d768ad2 ("net/ipv4: Add support for specifying metric
of connected routes"), when updating an IP address with a different metric,
the associated connected route is updated, too.

Still, the mentioned commit doesn't handle properly some corner cases:

$ ip addr add dev eth0 192.168.1.0/24
$ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2
$ ip addr add dev eth0 192.168.3.1/24
$ ip addr change dev eth0 192.168.1.0/24 metric 10
$ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10
$ ip addr change dev eth0 192.168.3.1/24 metric 10
$ ip -4 route
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0
192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1
192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10

Only the last route is correctly updated.

The problem is the current test in fib_modify_prefix_metric():

	if (!(dev->flags & IFF_UP) ||
	    ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) ||
	    ipv4_is_zeronet(prefix) ||
	    prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32)

Which should be the logical 'not' of the pre-existing test in
fib_add_ifaddr():

	if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) &&
	    (prefix != addr || ifa->ifa_prefixlen < 32))

To properly negate the original expression, we need to change the last
logical 'or' to a logical 'and'.

Fixes: af4d768ad2 ("net/ipv4: Add support for specifying metric of connected routes")
Reported-and-suggested-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:50 +01:00
Eric Dumazet
4f3df7f1ea net: use skb_queue_empty_lockless() in busy poll contexts
[ Upstream commit 3f926af3f4d688e2e11e7f8ed04e277a14d4d4a4 ]

Busy polling usually runs without locks.
Let's use skb_queue_empty_lockless() instead of skb_queue_empty()

Also uses READ_ONCE() in __skb_try_recv_datagram() to address
a similar potential problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:49 +01:00
Eric Dumazet
eaf548feaa net: use skb_queue_empty_lockless() in poll() handlers
[ Upstream commit 3ef7cf57c72f32f61e97f8fa401bc39ea1f1a5d4 ]

Many poll() handlers are lockless. Using skb_queue_empty_lockless()
instead of skb_queue_empty() is more appropriate.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:48 +01:00
Eric Dumazet
afa1f5e98c udp: use skb_queue_empty_lockless()
[ Upstream commit 137a0dbe3426fd7bcfe3f8117b36a87b3590e4eb ]

syzbot reported a data-race [1].

We should use skb_queue_empty_lockless() to document that we are
not ensuring a mutual exclusion and silence KCSAN.

[1]
BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb

write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0:
 __skb_insert include/linux/skbuff.h:1852 [inline]
 __skb_queue_before include/linux/skbuff.h:1958 [inline]
 __skb_queue_tail include/linux/skbuff.h:1991 [inline]
 __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470
 __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline]
 udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057
 udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074
 udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233
 __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300
 udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955

read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1:
 skb_queue_empty include/linux/skbuff.h:1494 [inline]
 __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653
 udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
 __do_sys_recvmmsg net/socket.c:2703 [inline]
 __se_sys_recvmmsg net/socket.c:2696 [inline]
 __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:47 +01:00
Eric Dumazet
a8a5adbbf7 udp: fix data-race in udp_set_dev_scratch()
[ Upstream commit a793183caa9afae907a0d7ddd2ffd57329369bf5 ]

KCSAN reported a data-race in udp_set_dev_scratch() [1]

The issue here is that we must not write over skb fields
if skb is shared. A similar issue has been fixed in commit
89c22d8c3b ("net: Fix skb csum races when peeking")

While we are at it, use a helper only dealing with
udp_skb_scratch(skb)->csum_unnecessary, as this allows
udp_set_dev_scratch() to be called once and thus inlined.

[1]
BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg

write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1:
 udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308
 __first_packet_length+0x147/0x420 net/ipv4/udp.c:1556
 first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579
 udp_poll+0xea/0x110 net/ipv4/udp.c:2720
 sock_poll+0xed/0x250 net/socket.c:1256
 vfs_poll include/linux/poll.h:90 [inline]
 do_select+0x7d0/0x1020 fs/select.c:534
 core_sys_select+0x381/0x550 fs/select.c:677
 do_pselect.constprop.0+0x11d/0x160 fs/select.c:759
 __do_sys_pselect6 fs/select.c:784 [inline]
 __se_sys_pselect6 fs/select.c:769 [inline]
 __x64_sys_pselect6+0x12e/0x170 fs/select.c:769
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0:
 udp_skb_csum_unnecessary include/net/udp.h:358 [inline]
 udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310
 inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592
 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
 __do_sys_recvmmsg net/socket.c:2703 [inline]
 __se_sys_recvmmsg net/socket.c:2696 [inline]
 __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:45 +01:00
Eric Dumazet
0cfaf03c5d net: annotate accesses to sk->sk_incoming_cpu
[ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ]

This socket field can be read and written by concurrent cpus.

Use READ_ONCE() and WRITE_ONCE() annotations to document this,
and avoid some compiler 'optimizations'.

KCSAN reported :

BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv

write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
 sk_incoming_cpu_update include/net/sock.h:953 [inline]
 tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955
 napi_poll net/core/dev.c:6392 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
 do_softirq kernel/softirq.c:329 [inline]
 __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189

read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
 sk_incoming_cpu_update include/net/sock.h:952 [inline]
 tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955
 napi_poll net/core/dev.c:6392 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:38 +01:00
Eric Dumazet
07de738901 inet: stop leaking jiffies on the wire
[ Upstream commit a904a0693c189691eeee64f6c6b188bd7dc244e9 ]

Historically linux tried to stick to RFC 791, 1122, 2003
for IPv4 ID field generation.

RFC 6864 made clear that no matter how hard we try,
we can not ensure unicity of IP ID within maximum
lifetime for all datagrams with a given source
address/destination address/protocol tuple.

Linux uses a per socket inet generator (inet_id), initialized
at connection startup with a XOR of 'jiffies' and other
fields that appear clear on the wire.

Thiemo Nagel pointed that this strategy is a privacy
concern as this provides 16 bits of entropy to fingerprint
devices.

Let's switch to a random starting point, this is just as
good as far as RFC 6864 is concerned and does not leak
anything critical.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thiemo Nagel <tnagel@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:37 +01:00
Xin Long
163901dc94 erspan: fix the tun_info options_len check for erspan
[ Upstream commit 2eb8d6d2910cfe3dc67dc056f26f3dd9c63d47cd ]

The check for !md doens't really work for ip_tunnel_info_opts(info) which
only does info + 1. Also to avoid out-of-bounds access on info, it should
ensure options_len is not less than erspan_metadata in both erspan_xmit()
and ip6erspan_tunnel_xmit().

Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10 11:27:37 +01:00
Stefano Brivio
2fa80e64de ipv4: Return -ENETUNREACH if we can't create route but saddr is valid
[ Upstream commit 595e0651d0296bad2491a4a29a7a43eae6328b02 ]

...instead of -EINVAL. An issue was found with older kernel versions
while unplugging a NFS client with pending RPCs, and the wrong error
code here prevented it from recovering once link is back up with a
configured address.

Incidentally, this is not an issue anymore since commit 4f8943f80883
("SUNRPC: Replace direct task wakeups from softirq context"), included
in 5.2-rc7, had the effect of decoupling the forwarding of this error
by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin
Coddington.

To the best of my knowledge, this isn't currently causing any further
issue, but the error code doesn't look appropriate anyway, and we
might hit this in other paths as well.

In detail, as analysed by Gonzalo Siero, once the route is deleted
because the interface is down, and can't be resolved and we return
-EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(),
as the socket error seen by tcp_write_err(), called by
tcp_retransmit_timer().

In turn, tcp_write_err() indirectly calls xs_error_report(), which
wakes up the RPC pending tasks with a status of -EINVAL. This is then
seen by call_status() in the SUN RPC implementation, which aborts the
RPC call calling rpc_exit(), instead of handling this as a
potentially temporary condition, i.e. as a timeout.

Return -EINVAL only if the input parameters passed to
ip_route_output_key_hash_rcu() are actually invalid (this is the case
if the specified source address is multicast, limited broadcast or
all zeroes), but return -ENETUNREACH in all cases where, at the given
moment, the given source address doesn't allow resolving the route.

While at it, drop the initialisation of err to -ENETUNREACH, which
was added to __ip_route_output_key() back then by commit
0315e38270 ("net: Fix behaviour of unreachable, blackhole and
prohibit routes"), but actually had no effect, as it was, and is,
overwritten by the fib_lookup() return code assignment, and anyway
ignored in all other branches, including the if (fl4->saddr) one:
I find this rather confusing, as it would look like -ENETUNREACH is
the "default" error, while that statement has no effect.

Also note that after commit fc75fc8339 ("ipv4: dont create routes
on down devices"), we would get -ENETUNREACH if the device is down,
but -EINVAL if the source address is specified and we can't resolve
the route, and this appears to be rather inconsistent.

Reported-by: Stefan Walter <walteste@inf.ethz.ch>
Analysed-by: Benjamin Coddington <bcodding@redhat.com>
Analysed-by: Gonzalo Siero <gsierohu@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-29 09:19:39 +01:00
Wei Wang
2ec0df4e3f ipv4: fix race condition between route lookup and invalidation
[ Upstream commit 5018c59607a511cdee743b629c76206d9c9e6d7b ]

Jesse and Ido reported the following race condition:
<CPU A, t0> - Received packet A is forwarded and cached dst entry is
taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()

<t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
from multiple ISPs"), route is added / deleted and rt_cache_flush() is
called

<CPU B, t2> - Received packet B tries to use the same cached dst entry
from t0, but rt_cache_valid() is no longer true and it is replaced in
rt_cache_route() by the newer one. This calls dst_dev_put() on the
original dst entry which assigns the blackhole netdev to 'dst->dev'

<CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
to 'dst->dev' being the blackhole netdev

There are 2 issues in the v4 routing code:
1. A per-netns counter is used to do the validation of the route. That
means whenever a route is changed in the netns, users of all routes in
the netns needs to redo lookup. v6 has an implementation of only
updating fn_sernum for routes that are affected.
2. When rt_cache_valid() returns false, rt_cache_route() is called to
throw away the current cache, and create a new one. This seems
unnecessary because as long as this route does not change, the route
cache does not need to be recreated.

To fully solve the above 2 issues, it probably needs quite some code
changes and requires careful testing, and does not suite for net branch.

So this patch only tries to add the deleted cached rt into the uncached
list, so user could still be able to use it to receive packets until
it's done.

Fixes: 95c47f9cf5 ("ipv4: call dst_dev_put() properly")
Signed-off-by: Wei Wang <weiwan@google.com>
Reported-by: Ido Schimmel <idosch@idosch.org>
Reported-by: Jesse Hathaway <jesse@mbuki-mvuki.org>
Tested-by: Jesse Hathaway <jesse@mbuki-mvuki.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-29 09:19:38 +01:00
Josh Hunt
012363f5de udp: only do GSO if # of segs > 1
[ Upstream commit 4094871db1d65810acab3d57f6089aa39ef7f648 ]

Prior to this change an application sending <= 1MSS worth of data and
enabling UDP GSO would fail if the system had SW GSO enabled, but the
same send would succeed if HW GSO offload is enabled. In addition to this
inconsistency the error in the SW GSO case does not get back to the
application if sending out of a real device so the user is unaware of this
failure.

With this change we only perform GSO if the # of segments is > 1 even
if the application has enabled segmentation. I've also updated the
relevant udpgso selftests.

Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Josh Hunt <johunt@akamai.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07 18:57:24 +02:00
Josh Hunt
544aee5461 udp: fix gso_segs calculations
[ Upstream commit 44b321e5020d782ad6e8ae8183f09b163be6e6e2 ]

Commit dfec0ee22c ("udp: Record gso_segs when supporting UDP segmentation offload")
added gso_segs calculation, but incorrectly got sizeof() the pointer and
not the underlying data type. In addition let's fix the v6 case.

Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Fixes: dfec0ee22c ("udp: Record gso_segs when supporting UDP segmentation offload")
Signed-off-by: Josh Hunt <johunt@akamai.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07 18:57:23 +02:00
Paolo Abeni
124b64feaf net: ipv4: avoid mixed n_redirects and rate_tokens usage
[ Upstream commit b406472b5ad79ede8d10077f0c8f05505ace8b6d ]

Since commit c09551c6ff7f ("net: ipv4: use a dedicated counter
for icmp_v4 redirect packets") we use 'n_redirects' to account
for redirect packets, but we still use 'rate_tokens' to compute
the redirect packets exponential backoff.

If the device sent to the relevant peer any ICMP error packet
after sending a redirect, it will also update 'rate_token' according
to the leaking bucket schema; typically 'rate_token' will raise
above BITS_PER_LONG and the redirect packets backoff algorithm
will produce undefined behavior.

Fix the issue using 'n_redirects' to compute the exponential backoff
in ip_rt_send_redirect().

Note that we still clear rate_tokens after a redirect silence period,
to avoid changing an established behaviour.

The root cause predates git history; before the mentioned commit in
the critical scenario, the kernel stopped sending redirects, after
the mentioned commit the behavior more randomic.

Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Fixes: c09551c6ff7f ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07 18:57:21 +02:00
Haishuang Yan
7f30c44b7c erspan: remove the incorrect mtu limit for erspan
[ Upstream commit 0e141f757b2c78c983df893e9993313e2dc21e38 ]

erspan driver calls ether_setup(), after commit 61e84623ac
("net: centralize net_device min/max MTU checking"), the range
of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.

It causes the dev mtu of the erspan device to not be greater
than 1500, this limit value is not correct for ipgre tap device.

Tested:
Before patch:
# ip link set erspan0 mtu 1600
Error: mtu greater than device maximum.
After patch:
# ip link set erspan0 mtu 1600
# ip -d link show erspan0
21: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1600 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 0

Fixes: 61e84623ac ("net: centralize net_device min/max MTU checking")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07 18:57:20 +02:00
Eric Dumazet
3fdcf6a88d tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
[ Upstream commit a66b10c05ee2d744189e9a2130394b070883d289 ]

Yuchung Cheng and Marek Majkowski independently reported a weird
behavior of TCP_USER_TIMEOUT option when used at connect() time.

When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
believes the flow should live, and the following condition
in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :

    remaining = icsk->icsk_user_timeout - elapsed;
    if (remaining <= 0)
        return 1; /* user timeout has passed; fire ASAP */

This silly situation ends when the max syn rtx count is reached.

This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT,
avoiding these spurious SYN packets.

Fixes: b701a99e43 ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Reported-by: Marek Majkowski <marek@cloudflare.com>
Cc: Jon Maxwell <jmaxwell37@gmail.com>
Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2
Acked-by: Jon Maxwell <jmaxwell37@gmail.com>
Tested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Marek Majkowski <marek@cloudflare.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05 13:09:31 +02:00
Stephen Hemminger
8ffd7ba9ff net: don't warn in inet diag when IPV6 is disabled
[ Upstream commit 1e64d7cbfdce4887008314d5b367209582223f27 ]

If IPV6 was disabled, then ss command would cause a kernel warning
because the command was attempting to dump IPV6 socket information.
The fix is to just remove the warning.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202249
Fixes: 432490f9d4 ("net: ip, diag -- Add diag interface for raw sockets")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-01 08:26:11 +02:00
Willem de Bruijn
fdd60d80c4 udp: correct reuseport selection with connected sockets
[ Upstream commit acdcecc61285faed359f1a3568c32089cc3a8329 ]

UDP reuseport groups can hold a mix unconnected and connected sockets.
Ensure that connections only receive all traffic to their 4-tuple.

Fast reuseport returns on the first reuseport match on the assumption
that all matches are equal. Only if connections are present, return to
the previous behavior of scoring all sockets.

Record if connections are present and if so (1) treat such connected
sockets as an independent match from the group, (2) only return
2-tuple matches from reuseport and (3) do not return on the first
2-tuple reuseport match to allow for a higher scoring match later.

New field has_conns is set without locks. No other fields in the
bitmap are modified at runtime and the field is only ever set
unconditionally, so an RMW cannot miss a change.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-21 07:16:43 +02:00
Neal Cardwell
67fe3b94a8 tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
[ Upstream commit af38d07ed391b21f7405fa1f936ca9686787d6d2 ]

Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
TCP_ECN_QUEUE_CWR.

Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
the behavior of data receivers, and deciding whether to reflect
incoming IP ECN CE marks as outgoing TCP th->ece marks. The
TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
is only called from tcp_undo_cwnd_reduction() by data senders during
an undo, so it should zero the sender-side state,
TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
incoming CE bits on incoming data packets just because outgoing
packets were spuriously retransmitted.

The bug has been reproduced with packetdrill to manifest in a scenario
with RFC3168 ECN, with an incoming data packet with CE bit set and
carrying a TCP timestamp value that causes cwnd undo. Before this fix,
the IP CE bit was ignored and not reflected in the TCP ECE header bit,
and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
even though the cwnd reduction had been undone.  After this fix, the
sender properly reflects the CE bit and does not set the W bit.

Note: the bug actually predates 2005 git history; this Fixes footer is
chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
the patch applies cleanly (since before this commit the code was in a
.h file).

Fixes: bdf1ee5d3b ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-19 09:09:32 +02:00
Eric Dumazet
5977bc19ce tcp: remove empty skb from write queue in error cases
[ Upstream commit fdfc5c8594c24c5df883583ebd286321a80e0a67 ]

Vladimir Rutsky reported stuck TCP sessions after memory pressure
events. Edge Trigger epoll() user would never receive an EPOLLOUT
notification allowing them to retry a sendmsg().

Jason tested the case of sk_stream_alloc_skb() returning NULL,
but there are other paths that could lead both sendmsg() and sendpage()
to return -1 (EAGAIN), with an empty skb queued on the write queue.

This patch makes sure we remove this empty skb so that
Jason code can detect that the queue is empty, and
call sk->sk_write_space(sk) accordingly.

Fixes: ce5ec44099 ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-10 10:33:40 +01:00
Willem de Bruijn
6f31263798 tcp: inherit timestamp on mtu probe
[ Upstream commit 888a5c53c0d8be6e98bc85b677f179f77a647873 ]

TCP associates tx timestamp requests with a byte in the bytestream.
If merging skbs in tcp_mtu_probe, migrate the tstamp request.

Similar to MSG_EOR, do not allow moving a timestamp from any segment
in the probe but the last. This to avoid merging multiple timestamps.

Tested with the packetdrill script at
https://github.com/wdebruij/packetdrill/commits/mtu_probe-1

Link: http://patchwork.ozlabs.org/patch/1143278/#2232897
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-10 10:33:39 +01:00
Hangbin Liu
9febfd30ae ipv4/icmp: fix rt dst dev null pointer dereference
[ Upstream commit e2c693934194fd3b4e795635934883354c06ebc9 ]

In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:

- ip6gre_tunnel_xmit
  - ip6gre_xmit_ipv4
    - __gre6_xmit
      - ip6_tnl_xmit
        - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
    - icmp_send
      - net = dev_net(rt->dst.dev); <-- here

The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the reason we need rt->dst.dev is to get the net.
So we can just try get it from skb->dev when rt->dst.dev is NULL.

v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
still use dst.dev and do a check to avoid crash.

v3: No changes.

v2: fix the issue in __icmp_send() instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.

Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-06 10:22:07 +02:00
Miaohe Lin
307b6e5d90 netfilter: Fix rpfilter dropping vrf packets by mistake
[ Upstream commit b575b24b8eee37f10484e951b62ce2a31c579775 ]

When firewalld is enabled with ipv4/ipv6 rpfilter, vrf
ipv4/ipv6 packets will be dropped. Vrf device will pass
through netfilter hook twice. One with enslaved device
and another one with l3 master device. So in device may
dismatch witch out device because out device is always
enslaved device.So failed with the check of the rpfilter
and drop the packets by mistake.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-16 10:12:43 +02:00
Haishuang Yan
f186fb5ccf ipip: validate header length in ipip_tunnel_xmit
[ Upstream commit 47d858d0bdcd47cc1c6c9eeca91b091dd9e55637 ]

We need the same checks introduced by commit cb9f1b783850
("ip: validate header length on virtual device xmit") for
ipip tunnel.

Fixes: cb9f1b783850b ("ip: validate header length on virtual device xmit")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-09 17:52:30 +02:00
Xin Long
4736bb2777 ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL
commit 5684abf7020dfc5f0b6ba1d68eda3663871fce52 upstream.

iptunnel_xmit() works as a common function, also used by a udp tunnel
which doesn't have to have a tunnel device, like how TIPC works with
udp media.

In these cases, we should allow not to count pkts on dev's tstats, so
that udp tunnel can work with no tunnel device safely.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04 09:30:57 +02:00
Christoph Paasch
1b200acde4 tcp: Reset bytes_acked and bytes_received when disconnecting
[ Upstream commit e858faf556d4e14c750ba1e8852783c6f9520a0e ]

If an app is playing tricks to reuse a socket via tcp_disconnect(),
bytes_acked/received needs to be reset to 0. Otherwise tcp_info will
report the sum of the current and the old connection..

Cc: Eric Dumazet <edumazet@google.com>
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: bdd1f9edac ("tcp: add tcpi_bytes_received to tcp_info")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:26 +02:00
Eric Dumazet
c60f57dfe9 tcp: fix tcp_set_congestion_control() use from bpf hook
[ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ]

Neal reported incorrect use of ns_capable() from bpf hook.

bpf_setsockopt(...TCP_CONGESTION...)
  -> tcp_set_congestion_control()
   -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
    -> ns_capable_common()
     -> current_cred()
      -> rcu_dereference_protected(current->cred, 1)

Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.

As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.

The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:26 +02:00
Eric Dumazet
6323c238bb tcp: be more careful in tcp_fragment()
[ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]

Some applications set tiny SO_SNDBUF values and expect
TCP to just work. Recent patches to address CVE-2019-11478
broke them in case of losses, since retransmits might
be prevented.

We should allow these flows to make progress.

This patch allows the first and last skb in retransmit queue
to be split even if memory limits are hit.

It also adds the some room due to the fact that tcp_sendmsg()
and tcp_sendpage() might overshoot sk_wmem_queued by about one full
TSO skb (64KB size). Note this allowance was already present
in stable backports for kernels < 4.15

Note for < 4.15 backports :
 tcp_rtx_queue_tail() will probably look like :

static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
{
	struct sk_buff *skb = tcp_send_head(sk);

	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
}

Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Cc: Jonathan Looney <jtl@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:25 +02:00
Matteo Croce
47ce442783 ipv4: don't set IPv6 only flags to IPv4 addresses
[ Upstream commit 2e60546368165c2449564d71f6005dda9205b5fb ]

Avoid the situation where an IPV6 only flag is applied to an IPv4 address:

    # ip addr add 192.0.2.1/24 dev dummy0 nodad home mngtmpaddr noprefixroute
    # ip -4 addr show dev dummy0
    2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
        inet 192.0.2.1/24 scope global noprefixroute dummy0
           valid_lft forever preferred_lft forever

Or worse, by sending a malicious netlink command:

    # ip -4 addr show dev dummy0
    2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
        inet 192.0.2.1/24 scope global nodad optimistic dadfailed home tentative mngtmpaddr noprefixroute stable-privacy dummy0
           valid_lft forever preferred_lft forever

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:23 +02:00
Eric Dumazet
aee5dd0034 igmp: fix memory leak in igmpv3_del_delrec()
[ Upstream commit e5b1c6c6277d5a283290a8c033c72544746f9b5b ]

im->tomb and/or im->sources might not be NULL, but we
currently overwrite their values blindly.

Using swap() will make sure the following call to kfree_pmc(pmc)
will properly free the psf structures.

Tested with the C repro provided by syzbot, which basically does :

 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
 setsockopt(3, SOL_IP, IP_ADD_MEMBERSHIP, "\340\0\0\2\177\0\0\1\0\0\0\0", 12) = 0
 ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=0}) = 0
 setsockopt(3, SOL_IP, IP_MSFILTER, "\340\0\0\2\177\0\0\1\1\0\0\0\1\0\0\0\377\377\377\377", 20) = 0
 ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP}) = 0
 exit_group(0)                    = ?

BUG: memory leak
unreferenced object 0xffff88811450f140 (size 64):
  comm "softirq", pid 0, jiffies 4294942448 (age 32.070s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  ................
    00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000c7bad083>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [<00000000c7bad083>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<00000000c7bad083>] slab_alloc mm/slab.c:3326 [inline]
    [<00000000c7bad083>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
    [<000000009acc4151>] kmalloc include/linux/slab.h:547 [inline]
    [<000000009acc4151>] kzalloc include/linux/slab.h:742 [inline]
    [<000000009acc4151>] ip_mc_add1_src net/ipv4/igmp.c:1976 [inline]
    [<000000009acc4151>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2100
    [<000000004ac14566>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2484
    [<0000000052d8f995>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:959
    [<000000004ee1e21f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1248
    [<0000000066cdfe74>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2618
    [<000000009383a786>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3126
    [<00000000d8ac0c94>] __sys_setsockopt+0x98/0x120 net/socket.c:2072
    [<000000001b1e9666>] __do_sys_setsockopt net/socket.c:2083 [inline]
    [<000000001b1e9666>] __se_sys_setsockopt net/socket.c:2080 [inline]
    [<000000001b1e9666>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2080
    [<00000000420d395e>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<000000007fd83a4b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 24803f38a5 ("igmp: do not remove igmp souce list info when set link down")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Reported-by: syzbot+6ca1abd0db68b5173a4f@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:23 +02:00
Martin KaFai Lau
79c6a8c099 bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro
commit 257a525fe2e49584842c504a92c27097407f778f upstream.

When the commit a6024562ff ("udp: Add GRO functions to UDP socket")
added udp[46]_lib_lookup_skb to the udp_gro code path, it broke
the reuseport_select_sock() assumption that skb->data is pointing
to the transport header.

This patch follows an earlier __udp6_lib_err() fix by
passing a NULL skb to avoid calling the reuseport's bpf_prog.

Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 13:14:48 +02:00
Daniel Borkmann
613bc37f74 bpf: fix unconnected udp hooks
commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.

Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 13:14:48 +02:00
Stephen Suryaputra
7c92f3efba ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop
[ Upstream commit 38c73529de13e1e10914de7030b659a2f8b01c3b ]

In commit 19e4e768064a8 ("ipv4: Fix raw socket lookup for local
traffic"), the dif argument to __raw_v4_lookup() is coming from the
returned value of inet_iif() but the change was done only for the first
lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex.

Fixes: 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic")
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 13:14:46 +02:00
Eric Dumazet
dad3a9314a tcp: refine memory limit test in tcp_fragment()
commit b6653b3629e5b88202be3c9abc44713973f5c4b4 upstream.

tcp_fragment() might be called for skbs in the write queue.

Memory limits might have been exceeded because tcp_sendmsg() only
checks limits at full skb (64KB) boundaries.

Therefore, we need to make sure tcp_fragment() wont punish applications
that might have setup very low SO_SNDBUF values.

Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-22 11:23:18 +02:00
Eric Dumazet
59222807fc tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
commit 967c05aee439e6e5d7d805e195b3a20ef5c433d6 upstream.

If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.

Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17 19:51:56 +02:00
Eric Dumazet
7f9f8a37e5 tcp: add tcp_min_snd_mss sysctl
commit 5f3e2bf008c2221478101ee72f5cb4654b9fc363 upstream.

Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.

This forces the stack to send packets with a very high network/cpu
overhead.

Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.

In some cases, it can be useful to increase the minimal value
to a saner value.

We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.

Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f1 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.

We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17 19:51:56 +02:00
Eric Dumazet
ec83921899 tcp: tcp_fragment() should apply sane memory limits
commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream.

Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.

TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.

A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.

Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.

CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
	socket is already using more than half the allowed space

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17 19:51:56 +02:00
Eric Dumazet
c09be31461 tcp: limit payload size of sacked skbs
commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream.

Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17 19:51:56 +02:00
Xin Long
daa11cc841 ipv4: not do cache for local delivery if bc_forwarding is enabled
[ Upstream commit 0a90478b93a46bdcd56ba33c37566a993e455d54 ]

With the topo:

    h1 ---| rp1            |
          |     route  rp3 |--- h3 (192.168.200.1)
    h2 ---| rp2            |

If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
h2, and the packets can still be forwared.

This issue was caused by the input route cache. It should only do
the cache for either bc forwarding or local delivery. Otherwise,
local delivery can use the route cache for bc forwarding of other
interfaces.

This patch is to fix it by not doing cache for local delivery if
all.bc_forwarding is enabled.

Note that we don't fix it by checking route cache local flag after
rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
common route code shouldn't be touched for bc_forwarding.

Fixes: 5cbf777cfd ("route: add support for directed broadcast forwarding")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11 12:20:47 +02:00
Eric Dumazet
46702dd5d5 ipv4/igmp: fix build error if !CONFIG_IP_MULTICAST
[ Upstream commit 903869bd10e6719b9df6718e785be7ec725df59f ]

ip_sf_list_clear_all() needs to be defined even if !CONFIG_IP_MULTICAST

Fixes: 3580d04aa674 ("ipv4/igmp: fix another memory leak in igmpv3_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04 08:02:31 +02:00
Eric Dumazet
e9f94e480f ipv4/igmp: fix another memory leak in igmpv3_del_delrec()
[ Upstream commit 3580d04aa674383c42de7b635d28e52a1e5bc72c ]

syzbot reported memory leaks [1] that I have back tracked to
a missing cleanup from igmpv3_del_delrec() when
(im->sfmode != MCAST_INCLUDE)

Add ip_sf_list_clear_all() and kfree_pmc() helpers to explicitely
handle the cleanups before freeing.

[1]

BUG: memory leak
unreferenced object 0xffff888123e32b00 (size 64):
  comm "softirq", pid 0, jiffies 4294942968 (age 8.010s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 e0 00 00 01 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000006105011b>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
    [<000000006105011b>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<000000006105011b>] slab_alloc mm/slab.c:3326 [inline]
    [<000000006105011b>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
    [<000000004bba8073>] kmalloc include/linux/slab.h:547 [inline]
    [<000000004bba8073>] kzalloc include/linux/slab.h:742 [inline]
    [<000000004bba8073>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
    [<000000004bba8073>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
    [<00000000a46a65a0>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
    [<000000005956ca89>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:957
    [<00000000848e2d2f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
    [<00000000b9db185c>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
    [<000000003028e438>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
    [<0000000015b65589>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
    [<00000000ac198ef0>] __do_sys_setsockopt net/socket.c:2089 [inline]
    [<00000000ac198ef0>] __se_sys_setsockopt net/socket.c:2086 [inline]
    [<00000000ac198ef0>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
    [<000000000a770437>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<00000000d3adb93b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 9c8bb163ae ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04 08:02:31 +02:00
Eric Dumazet
07480da0c8 inet: switch IP ID generator to siphash
[ Upstream commit df453700e8d81b1bdafdf684365ee2b9431fb702 ]

According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04 08:02:30 +02:00
Steffen Klassert
a654a73de2 xfrm4: Fix uninitialized memory read in _decode_session4
[ Upstream commit 8742dc86d0c7a9628117a989c11f04a9b6b898f3 ]

We currently don't reload pointers pointing into skb header
after doing pskb_may_pull() in _decode_session4(). So in case
pskb_may_pull() changed the pointers, we read from random
memory. Fix this by putting all the needed infos on the
stack, so that we don't need to access the header pointers
after doing pskb_may_pull().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-25 18:23:41 +02:00
Sabrina Dubroca
3716c26250 esp4: add length check for UDP encapsulation
[ Upstream commit 8dfb4eba4100e7cdd161a8baef2d8d61b7a7e62e ]

esp_output_udp_encap can produce a length that doesn't fit in the 16
bits of a UDP header's length field. In that case, we'll send a
fragmented packet whose length is larger than IP_MAX_MTU (resulting in
"Oversized IP packet" warnings on receive) and with a bogus UDP
length.

To prevent this, add a length check to esp_output_udp_encap and return
 -EMSGSIZE on failure.

This seems to be older than git history.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-25 18:23:41 +02:00
Jeremy Sowden
159269cc64 vti4: ipip tunnel deregistration fixes.
[ Upstream commit 5483844c3fc18474de29f5d6733003526e0a9f78 ]

If tunnel registration failed during module initialization, the module
would fail to deregister the IPPROTO_COMP protocol and would attempt to
deregister the tunnel.

The tunnel was not deregistered during module-exit.

Fixes: dd9ee3444014e ("vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-25 18:23:41 +02:00
David Ahern
da2e770f0c ipv4: Fix raw socket lookup for local traffic
[ Upstream commit 19e4e768064a87b073a4b4c138b55db70e0cfb9f ]

inet_iif should be used for the raw socket lookup. inet_iif considers
rt_iif which handles the case of local traffic.

As it stands, ping to a local address with the '-I <dev>' option fails
ever since ping was changed to use SO_BINDTODEVICE instead of
cmsg + IP_PKTINFO.

IPv6 works fine.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16 19:41:29 +02:00
Shmulik Ladkani
3f611a4799 ipv4: ip_do_fragment: Preserve skb_iif during fragmentation
[ Upstream commit d2f0c961148f65bc73eda72b9fa3a4e80973cb49 ]

Previously, during fragmentation after forwarding, skb->skb_iif isn't
preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given
'from' skb.

As a result, ip_do_fragment's creates fragments with zero skb_iif,
leading to inconsistent behavior.

Assume for example an eBPF program attached at tc egress (post
forwarding) that examines __sk_buff->ingress_ifindex:
 - the correct iif is observed if forwarding path does not involve
   fragmentation/refragmentation
 - a bogus iif is observed if forwarding path involves
   fragmentation/refragmentatiom

Fix, by preserving skb_iif during 'ip_copy_metadata'.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-05 14:42:36 +02:00
ZhangXiaoxu
250e51f856 ipv4: set the tcp_min_rtt_wlen range from 0 to one day
[ Upstream commit 19fad20d15a6494f47f85d869f00b11343ee5c78 ]

There is a UBSAN report as below:
UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56
signed integer overflow:
2147483647 * 1000 cannot be represented in type 'int'
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e #1
Call Trace:
 <IRQ>
 dump_stack+0x8c/0xba
 ubsan_epilogue+0x11/0x60
 handle_overflow+0x12d/0x170
 ? ttwu_do_wakeup+0x21/0x320
 __ubsan_handle_mul_overflow+0x12/0x20
 tcp_ack_update_rtt+0x76c/0x780
 tcp_clean_rtx_queue+0x499/0x14d0
 tcp_ack+0x69e/0x1240
 ? __wake_up_sync_key+0x2c/0x50
 ? update_group_capacity+0x50/0x680
 tcp_rcv_established+0x4e2/0xe10
 tcp_v4_do_rcv+0x22b/0x420
 tcp_v4_rcv+0xfe8/0x1190
 ip_protocol_deliver_rcu+0x36/0x180
 ip_local_deliver+0x15b/0x1a0
 ip_rcv+0xac/0xd0
 __netif_receive_skb_one_core+0x7f/0xb0
 __netif_receive_skb+0x33/0xc0
 netif_receive_skb_internal+0x84/0x1c0
 napi_gro_receive+0x2a0/0x300
 receive_buf+0x3d4/0x2350
 ? detach_buf_split+0x159/0x390
 virtnet_poll+0x198/0x840
 ? reweight_entity+0x243/0x4b0
 net_rx_action+0x25c/0x770
 __do_softirq+0x19b/0x66d
 irq_exit+0x1eb/0x230
 do_IRQ+0x7a/0x150
 common_interrupt+0xf/0xf
 </IRQ>

It can be reproduced by:
  echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen

Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02 09:58:59 +02:00
Eric Dumazet
07445fea95 ipv4: add sanity checks in ipv4_link_failure()
[ Upstream commit 20ff83f10f113c88d0bb74589389b05250994c16 ]

Before calling __ip_options_compile(), we need to ensure the network
header is a an IPv4 one, and that it is already pulled in skb->head.

RAW sockets going through a tunnel can end up calling ipv4_link_failure()
with total garbage in the skb, or arbitrary lengthes.

syzbot report :

BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline]
BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204

CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 check_memory_region_inline mm/kasan/generic.c:185 [inline]
 check_memory_region+0x123/0x190 mm/kasan/generic.c:191
 memcpy+0x38/0x50 mm/kasan/common.c:133
 memcpy include/linux/string.h:355 [inline]
 __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
 __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695
 ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204
 dst_link_failure include/net/dst.h:427 [inline]
 vti6_xmit net/ipv6/ip6_vti.c:514 [inline]
 vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553
 __netdev_start_xmit include/linux/netdevice.h:4414 [inline]
 netdev_start_xmit include/linux/netdevice.h:4423 [inline]
 xmit_one net/core/dev.c:3292 [inline]
 dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308
 __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3911
 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527
 neigh_output include/net/neighbour.h:508 [inline]
 ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229
 ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317
 NF_HOOK_COND include/linux/netfilter.h:278 [inline]
 ip_output+0x21f/0x670 net/ipv4/ip_output.c:405
 dst_output include/net/dst.h:444 [inline]
 NF_HOOK include/linux/netfilter.h:289 [inline]
 raw_send_hdrinc net/ipv4/raw.c:432 [inline]
 raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663
 inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg+0xdd/0x130 net/socket.c:661
 sock_write_iter+0x27c/0x3e0 net/socket.c:988
 call_write_iter include/linux/fs.h:1866 [inline]
 new_sync_write+0x4c7/0x760 fs/read_write.c:474
 __vfs_write+0xe4/0x110 fs/read_write.c:487
 vfs_write+0x20c/0x580 fs/read_write.c:549
 ksys_write+0x14f/0x2d0 fs/read_write.c:599
 __do_sys_write fs/read_write.c:611 [inline]
 __se_sys_write fs/read_write.c:608 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:608
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458c29
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29
RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4
R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff

The buggy address belongs to the page:
page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1fffc0000000000()
raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
 ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
                         ^
 ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00
 ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-02 09:58:59 +02:00
Peter Oskolkov
702ddf862d net: IP defrag: encapsulate rbtree defrag code into callable functions
[ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ]

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

v2: make handling of overlapping packets match upstream.

Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-27 09:36:33 +02:00
Eric Dumazet
7ba5ec69e1 ipv4: ensure rcu_read_lock() in ipv4_link_failure()
[ Upstream commit c543cb4a5f07e09237ec0fc2c60c9f131b2c79ad ]

fib_compute_spec_dst() needs to be called under rcu protection.

syzbot reported :

WARNING: suspicious RCU usage
5.1.0-rc4+ #165 Not tainted
include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by swapper/0/0:
 #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline]
 #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162
 __in_dev_get_rcu include/linux/inetdevice.h:220 [inline]
 fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294
 spec_dst_fill net/ipv4/ip_options.c:245 [inline]
 __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343
 ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195
 dst_link_failure include/net/dst.h:427 [inline]
 arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297
 neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995
 neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081
 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
 expire_timers kernel/time/timer.c:1362 [inline]
 __run_timers kernel/time/timer.c:1681 [inline]
 __run_timers kernel/time/timer.c:1649 [inline]
 run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
 __do_softirq+0x266/0x95a kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807

Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27 09:36:31 +02:00
Stephen Suryaputra
8a430e56a6 ipv4: recompile ip options in ipv4_link_failure
[ Upstream commit ed0de45a1008991fdaa27a0152befcb74d126a8b ]

Recompile IP options since IPCB may not be valid anymore when
ipv4_link_failure is called from arp_error_report.

Refer to the commit 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error")
and the commit before that (9ef6b42ad6fd) for a similar issue.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27 09:36:31 +02:00
Eric Dumazet
6728c6174a tcp: tcp_grow_window() needs to respect tcp_space()
[ Upstream commit 50ce163a72d817a99e8974222dcf2886d5deb1ae ]

For some reason, tcp_grow_window() correctly tests if enough room
is present before attempting to increase tp->rcv_ssthresh,
but does not prevent it to grow past tcp_space()

This is causing hard to debug issues, like failing
the (__tcp_select_window(sk) >= tp->rcv_wnd) test
in __tcp_ack_snd_check(), causing ACK delays and possibly
slow flows.

Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
after about 60 round trips, when the active side no longer sends
immediate acks.

This bug predates git history.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27 09:36:31 +02:00
Lorenzo Bianconi
1cd8788368 net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recv
[ Upstream commit 988dc4a9a3b66be75b30405a5494faf0dc7cffb6 ]

gue tunnels run iptunnel_pull_offloads on received skbs. This can
determine a possible use-after-free accessing guehdr pointer since
the packet will be 'uncloned' running pskb_expand_head if it is a
cloned gso skb (e.g if the packet has been sent though a veth device)

Fixes: a09a4c8dd1 ("tunnels: Remove encapsulation offloads on decap")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27 09:36:31 +02:00
Lorenzo Bianconi
5c6f2f4c0e net: ip_gre: fix possible use-after-free in erspan_rcv
[ Upstream commit 492b67e28ee5f2a2522fb72e3d3bcb990e461514 ]

erspan tunnels run __iptunnel_pull_header on received skbs to remove
gre and erspan headers. This can determine a possible use-after-free
accessing pkt_md pointer in erspan_rcv since the packet will be 'uncloned'
running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
been sent though a veth device). Fix it resetting pkt_md pointer after
__iptunnel_pull_header

Fixes: 1d7e2ed22f ("net: erspan: refactor existing erspan code")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17 08:38:43 +02:00
Stephen Suryaputra
0516ef27dd vrf: check accept_source_route on the original netdevice
[ Upstream commit 8c83f2df9c6578ea4c5b940d8238ad8a41b87e9e ]

Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.

v2->v3:
- Simplify by passing the original netdevice down the stack (per David
  Ahern).

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17 08:38:42 +02:00
Dust Li
7243e35209 tcp: fix a potential NULL pointer dereference in tcp_sk_exit
[ Upstream commit b506bc975f60f06e13e74adb35e708a23dc4e87c ]

 When tcp_sk_init() failed in inet_ctl_sock_create(),
 'net->ipv4.tcp_congestion_control' will be left
 uninitialized, but tcp_sk_exit() hasn't check for
 that.

 This patch add checking on 'net->ipv4.tcp_congestion_control'
 in tcp_sk_exit() to prevent NULL-ptr dereference.

Fixes: 6670e15244 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17 08:38:41 +02:00
Koen De Schepper
0e0afb06e1 tcp: Ensure DCTCP reacts to losses
[ Upstream commit aecfde23108b8e637d9f5c5e523b24fb97035dc3 ]

RFC8257 §3.5 explicitly states that "A DCTCP sender MUST react to
loss episodes in the same way as conventional TCP".

Currently, Linux DCTCP performs no cwnd reduction when losses
are encountered. Optionally, the dctcp_clamp_alpha_on_loss resets
alpha to its maximal value if a RTO happens. This behavior
is sub-optimal for at least two reasons: i) it ignores losses
triggering fast retransmissions; and ii) it causes unnecessary large
cwnd reduction in the future if the loss was isolated as it resets
the historical term of DCTCP's alpha EWMA to its maximal value (i.e.,
denoting a total congestion). The second reason has an especially
noticeable effect when using DCTCP in high BDP environments, where
alpha normally stays at low values.

This patch replace the clamping of alpha by setting ssthresh to
half of cwnd for both fast retransmissions and RTOs, at most once
per RTT. Consequently, the dctcp_clamp_alpha_on_loss module parameter
has been removed.

The table below shows experimental results where we measured the
drop probability of a PIE AQM (not applying ECN marks) at a
bottleneck in the presence of a single TCP flow with either the
alpha-clamping option enabled or the cwnd halving proposed by this
patch. Results using reno or cubic are given for comparison.

                          |  Link   |   RTT    |    Drop
                 TCP CC   |  speed  | base+AQM | probability
        ==================|=========|==========|============
                    CUBIC |  40Mbps |  7+20ms  |    0.21%
                     RENO |         |          |    0.19%
        DCTCP-CLAMP-ALPHA |         |          |   25.80%
         DCTCP-HALVE-CWND |         |          |    0.22%
        ------------------|---------|----------|------------
                    CUBIC | 100Mbps |  7+20ms  |    0.03%
                     RENO |         |          |    0.02%
        DCTCP-CLAMP-ALPHA |         |          |   23.30%
         DCTCP-HALVE-CWND |         |          |    0.04%
        ------------------|---------|----------|------------
                    CUBIC | 800Mbps |   1+1ms  |    0.04%
                     RENO |         |          |    0.05%
        DCTCP-CLAMP-ALPHA |         |          |   18.70%
         DCTCP-HALVE-CWND |         |          |    0.06%

We see that, without halving its cwnd for all source of losses,
DCTCP drives the AQM to large drop probabilities in order to keep
the queue length under control (i.e., it repeatedly faces RTOs).
Instead, if DCTCP reacts to all source of losses, it can then be
controlled by the AQM using similar drop levels than cubic or reno.

Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Cc: Bob Briscoe <research@bobbriscoe.net>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <borkmann@iogearbox.net>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17 08:38:41 +02:00
Anders Roxell
0d98ecb141 netfilter: ipt_CLUSTERIP: fix warning unused variable cn
commit 206b8cc514d7ff2b79dd2d5ad939adc7c493f07a upstream.

When CONFIG_PROC_FS isn't set the variable cn isn't used.

net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’:
net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable]
  struct clusterip_net *cn = clusterip_pernet(net);
                        ^~

Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS".

Fixes: b12f7bad5ad3 ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:57 +01:00
Martin Willi
b92eaed36c esp: Skip TX bytes accounting when sending from a request socket
[ Upstream commit 09db51241118aeb06e1c8cd393b45879ce099b36 ]

On ESP output, sk_wmem_alloc is incremented for the added padding if a
socket is associated to the skb. When replying with TCP SYNACKs over
IPsec, the associated sk is a casted request socket, only. Increasing
sk_wmem_alloc on a request socket results in a write at an arbitrary
struct offset. In the best case, this produces the following WARNING:

WARNING: CPU: 1 PID: 0 at lib/refcount.c:102 esp_output_head+0x2e4/0x308 [esp4]
refcount_t: addition on 0; use-after-free.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc3 #2
Hardware name: Marvell Armada 380/385 (Device Tree)
[...]
[<bf0ff354>] (esp_output_head [esp4]) from [<bf1006a4>] (esp_output+0xb8/0x180 [esp4])
[<bf1006a4>] (esp_output [esp4]) from [<c05dee64>] (xfrm_output_resume+0x558/0x664)
[<c05dee64>] (xfrm_output_resume) from [<c05d07b0>] (xfrm4_output+0x44/0xc4)
[<c05d07b0>] (xfrm4_output) from [<c05956bc>] (tcp_v4_send_synack+0xa8/0xe8)
[<c05956bc>] (tcp_v4_send_synack) from [<c0586ad8>] (tcp_conn_request+0x7f4/0x948)
[<c0586ad8>] (tcp_conn_request) from [<c058c404>] (tcp_rcv_state_process+0x2a0/0xe64)
[<c058c404>] (tcp_rcv_state_process) from [<c05958ac>] (tcp_v4_do_rcv+0xf0/0x1f4)
[<c05958ac>] (tcp_v4_do_rcv) from [<c0598a4c>] (tcp_v4_rcv+0xdb8/0xe20)
[<c0598a4c>] (tcp_v4_rcv) from [<c056eb74>] (ip_protocol_deliver_rcu+0x2c/0x2dc)
[<c056eb74>] (ip_protocol_deliver_rcu) from [<c056ee6c>] (ip_local_deliver_finish+0x48/0x54)
[<c056ee6c>] (ip_local_deliver_finish) from [<c056eecc>] (ip_local_deliver+0x54/0xec)
[<c056eecc>] (ip_local_deliver) from [<c056efac>] (ip_rcv+0x48/0xb8)
[<c056efac>] (ip_rcv) from [<c0519c2c>] (__netif_receive_skb_one_core+0x50/0x6c)
[...]

The issue triggers only when not using TCP syncookies, as for syncookies
no socket is associated.

Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:48 +01:00
Guillaume Nault
173e9023a0 tcp: handle inet_csk_reqsk_queue_add() failures
[  Upstream commit 9d3e1368bb45893a75a5dfb7cd21fdebfa6b47af ]

Commit 7716682cc5 ("tcp/dccp: fix another race at listener
dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted
{tcp,dccp}_check_req() accordingly. However, TFO and syncookies
weren't modified, thus leaking allocated resources on error.

Contrary to tcp_check_req(), in both syncookies and TFO cases,
we need to drop the request socket. Also, since the child socket is
created with inet_csk_clone_lock(), we have to unlock it and drop an
extra reference (->sk_refcount is initially set to 2 and
inet_csk_reqsk_queue_add() drops only one ref).

For TFO, we also need to revert the work done by tcp_try_fastopen()
(with reqsk_fastopen_remove()).

Fixes: 7716682cc5 ("tcp/dccp: fix another race at listener dismantle")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-19 13:12:39 +01:00
Christoph Paasch
fba43f49fd tcp: Don't access TCP_SKB_CB before initializing it
[ Upstream commit f2feaefdabb0a6253aa020f65e7388f07a9ed47c ]

Since commit eeea10b83a ("tcp: add
tcp_v4_fill_cb()/tcp_v4_restore_cb()"), tcp_vX_fill_cb is only called
after tcp_filter(). That means, TCP_SKB_CB(skb)->end_seq still points to
the IP-part of the cb.

We thus should not mock with it, as this can trigger bugs (thanks
syzkaller):
[   12.349396] ==================================================================
[   12.350188] BUG: KASAN: slab-out-of-bounds in ip6_datagram_recv_specific_ctl+0x19b3/0x1a20
[   12.351035] Read of size 1 at addr ffff88006adbc208 by task test_ip6_datagr/1799

Setting end_seq is actually no more necessary in tcp_filter as it gets
initialized later on in tcp_vX_fill_cb.

Cc: Eric Dumazet <edumazet@google.com>
Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-19 13:12:39 +01:00
Soheil Hassas Yeganeh
8accd04eb9 tcp: do not report TCP_CM_INQ of 0 for closed connections
[ Upstream commit 6466e715651f9f358e60c5ea4880e4731325827f ]

Returning 0 as inq to userspace indicates there is no more data to
read, and the application needs to wait for EPOLLIN. For a connection
that has received FIN from the remote peer, however, the application
must continue reading until getting EOF (return value of 0
from tcp_recvmsg) or an error, if edge-triggered epoll (EPOLLET) is
being used. Otherwise, the application will never receive a new
EPOLLIN, since there is no epoll edge after the FIN.

Return 1 when there is no data left on the queue but the
connection has received FIN, so that the applications continue
reading.

Fixes: b75eba76d3 (tcp: send in-queue bytes in cmsg upon read)
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-19 13:12:39 +01:00
Xin Long
eaa0962e1e route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race
[ Upstream commit ee60ad219f5c7c4fb2f047f88037770063ef785f ]

The race occurs in __mkroute_output() when 2 threads lookup a dst:

  CPU A                 CPU B
  find_exception()
                        find_exception() [fnhe expires]
                        ip_del_fnhe() [fnhe is deleted]
  rt_bind_exception()

In rt_bind_exception() it will bind a deleted fnhe with the new dst, and
this dst will get no chance to be freed. It causes a dev defcnt leak and
consecutive dmesg warnings:

  unregister_netdevice: waiting for ethX to become free. Usage count = 1

Especially thanks Jon to identify the issue.

This patch fixes it by setting fnhe_daddr to 0 in ip_del_fnhe() to stop
binding the deleted fnhe with a new dst when checking fnhe's fnhe_daddr
and daddr in rt_bind_exception().

It works as both ip_del_fnhe() and rt_bind_exception() are protected by
fnhe_lock and the fhne is freed by kfree_rcu().

Fixes: deed49df73 ("route: check and remove route cache when we get route")
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-19 13:12:39 +01:00
Paolo Abeni
7760937dc2 ipv4/route: fail early when inet dev is missing
[ Upstream commit 22c74764aa2943ecdf9f07c900d8a9c8ba6c9265 ]

If a non local multicast packet reaches ip_route_input_rcu() while
the ingress device IPv4 private data (in_dev) is NULL, we end up
doing a NULL pointer dereference in IN_DEV_MFORWARD().

Since the later call to ip_route_input_mc() is going to fail if
!in_dev, we can fail early in such scenario and avoid the dangerous
code path.

v1 -> v2:
 - clarified the commit message, no code changes

Reported-by: Tianhao Zhao <tizhao@redhat.com>
Fixes: e58e415968 ("net: Enable support for VRF with ipv4 multicast")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-19 13:12:38 +01:00
Su Yanjun
8ce41db0dc vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel
[ Upstream commit dd9ee3444014e8f28c0eefc9fffc9ac9c5248c12 ]

Recently we run a network test over ipcomp virtual tunnel.We find that
if a ipv4 packet needs fragment, then the peer can't receive
it.

We deep into the code and find that when packet need fragment the smaller
fragment will be encapsulated by ipip not ipcomp. So when the ipip packet
goes into xfrm, it's skb->dev is not properly set. The ipv4 reassembly code
always set skb'dev to the last fragment's dev. After ipv4 defrag processing,
when the kernel rp_filter parameter is set, the skb will be drop by -EXDEV
error.

This patch adds compatible support for the ipip process in ipcomp virtual tunnel.

Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:26 -07:00
David Ahern
cc211561d1 ipv4: Pass original device to ip_rcv_finish_core
[ Upstream commit a1fd1ad2552fad9e649eeb85fd79301e2880a886 ]

ip_route_input_rcu expects the original ingress device (e.g., for
proper multicast handling). The skb->dev can be changed by l3mdev_ip_rcv,
so dev needs to be saved prior to calling it. This was the behavior prior
to the listify changes.

Fixes: 5fa12739a5 ("net: ipv4: listify ip_rcv_finish")
Cc: Edward Cree <ecree@solarflare.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:19 +01:00
David Ahern
8c0aa3f690 ipv4: Return error for RTA_VIA attribute
[ Upstream commit b6e9e5df4ecf100f6a10ab2ade8e46d47a4b9779 ]

IPv4 currently does not support nexthops outside of the AF_INET family.
Specifically, it does not handle RTA_VIA attribute. If it is passed
in a route add request, the actual route added only uses the device
which is clearly not what the user intended:

  $ ip ro add 172.16.1.0/24 via inet6 2001:db8:1::1 dev eth0
  $ ip ro ls
  ...
  172.16.1.0/24 dev eth0

Catch this and fail the route add:
  $ ip ro add 172.16.1.0/24 via inet6 2001:db8:1::1 dev eth0
  Error: IPv4 does not support RTA_VIA attribute.

Fixes: 03c0566542 ("mpls: Netlink commands to add, remove, and dump routes")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:19 +01:00
Nazarov Sergey
125bc1e67e net: avoid use IPCB in cipso_v4_error
[ Upstream commit 3da1ed7ac398f34fff1694017a07054d69c5f5c5 ]

Extract IP options in cipso_v4_error and use __icmp_send.

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:19 +01:00
Nazarov Sergey
f2397468fb net: Add __icmp_send helper.
[ Upstream commit 9ef6b42ad6fd7929dd1b6092cb02014e382c6a91 ]

Add __icmp_send function having ip_options struct parameter

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:19 +01:00
Paul Moore
e3713abc42 netlabel: fix out-of-bounds memory accesses
[ Upstream commit 5578de4834fe0f2a34fedc7374be691443396d1f ]

There are two array out-of-bounds memory accesses, one in
cipso_v4_map_lvl_valid(), the other in netlbl_bitmap_walk().  Both
errors are embarassingly simple, and the fixes are straightforward.

As a FYI for anyone backporting this patch to kernels prior to v4.8,
you'll want to apply the netlbl_bitmap_walk() patch to
cipso_v4_bitmap_walk() as netlbl_bitmap_walk() doesn't exist before
Linux v4.8.

Reported-by: Jann Horn <jannh@google.com>
Fixes: 446fda4f26 ("[NetLabel]: CIPSOv4 engine")
Fixes: 3faa8f982f ("netlabel: Move bitmap manipulation functions to the NetLabel core.")
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:18 +01:00
Hangbin Liu
99ed945821 ipv4: Add ICMPv6 support when parse route ipproto
[ Upstream commit 5e1a99eae84999a2536f50a0beaf5d5262337f40 ]

For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
But for ip -6 route, currently we only support tcp, udp and icmp.

Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.

v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: eacb9384a3 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:17 +01:00
Taehee Yoo
6546e1150c netfilter: ipt_CLUSTERIP: fix sleep-in-atomic bug in clusterip_config_entry_put()
commit 2a61d8b883bbad26b06d2e6cc3777a697e78830d upstream.

A proc_remove() can sleep. so that it can't be inside of spin_lock.
Hence proc_remove() is moved to outside of spin_lock. and it also
adds mutex to sync create and remove of proc entry(config->pde).

test commands:
SHELL#1
   %while :; do iptables -A INPUT -p udp -i enp2s0 -d 192.168.1.100 \
	   --dport 9000  -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:21 --total-nodes 3 --local-node 3; \
	   iptables -F; done

SHELL#2
   %while :; do echo +1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; \
	   echo -1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; done

[ 2949.569864] BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
[ 2949.579944] in_atomic(): 1, irqs_disabled(): 0, pid: 5472, name: iptables
[ 2949.587920] 1 lock held by iptables/5472:
[ 2949.592711]  #0: 000000008f0ebcf2 (&(&cn->lock)->rlock){+...}, at: refcount_dec_and_lock+0x24/0x50
[ 2949.603307] CPU: 1 PID: 5472 Comm: iptables Tainted: G        W         4.19.0-rc5+ #16
[ 2949.604212] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[ 2949.604212] Call Trace:
[ 2949.604212]  dump_stack+0xc9/0x16b
[ 2949.604212]  ? show_regs_print_info+0x5/0x5
[ 2949.604212]  ___might_sleep+0x2eb/0x420
[ 2949.604212]  ? set_rq_offline.part.87+0x140/0x140
[ 2949.604212]  ? _rcu_barrier_trace+0x400/0x400
[ 2949.604212]  wait_for_completion+0x94/0x710
[ 2949.604212]  ? wait_for_completion_interruptible+0x780/0x780
[ 2949.604212]  ? __kernel_text_address+0xe/0x30
[ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
[ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
[ 2949.604212]  ? __init_waitqueue_head+0x86/0x130
[ 2949.604212]  ? init_wait_entry+0x1a0/0x1a0
[ 2949.604212]  proc_entry_rundown+0x208/0x270
[ 2949.604212]  ? proc_reg_get_unmapped_area+0x370/0x370
[ 2949.604212]  ? __lock_acquire+0x4500/0x4500
[ 2949.604212]  ? complete+0x18/0x70
[ 2949.604212]  remove_proc_subtree+0x143/0x2a0
[ 2949.708655]  ? remove_proc_entry+0x390/0x390
[ 2949.708655]  clusterip_tg_destroy+0x27a/0x630 [ipt_CLUSTERIP]
[ ... ]

Fixes: b3e456fce9 ("netfilter: ipt_CLUSTERIP: fix a race condition of proc file creation")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:09:03 +01:00
Konstantin Khlebnikov
589503cb24 inet_diag: fix reporting cgroup classid and fallback to priority
[ Upstream commit 1ec17dbd90f8b638f41ee650558609c1af63dfa0 ]

Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
extensions has only 8 bits. Thus extensions starting from DCTCPINFO
cannot be requested directly. Some of them included into response
unconditionally or hook into some of lower 8 bits.

Extension INET_DIAG_CLASS_ID has not way to request from the beginning.

This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
reservation, and documents behavior for other extensions.

Also this patch adds fallback to reporting socket priority. This filed
is more widely used for traffic classification because ipv4 sockets
automatically maps TOS to priority and default qdisc pfifo_fast knows
about that. But priority could be changed via setsockopt SO_PRIORITY so
INET_DIAG_TOS isn't enough for predicting class.

Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).

So, after this patch INET_DIAG_CLASS_ID will report socket priority
for most common setup when net_cls isn't set and/or cgroup2 in use.

Fixes: 0888e372c3 ("net: inet: diag: expose sockets cgroup classid")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:08:58 +01:00
Jann Horn
6a3f723787 netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbs
commit c4c07b4d6fa1f11880eab8e076d3d060ef3f55fc upstream.

The generic ASN.1 decoder infrastructure doesn't guarantee that callbacks
will get as much data as they expect; callbacks have to check the `datalen`
parameter before looking at `data`. Make sure that snmp_version() and
snmp_helper() don't read/write beyond the end of the packet data.

(Also move the assignment to `pdata` down below the check to make it clear
that it isn't necessarily a pointer we can use before the `datalen` check.)

Fixes: cc2d58634e ("netfilter: nf_nat_snmp_basic: use asn1 decoder library")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23 09:07:27 +01:00
Eric Dumazet
1bdb1675f3 tcp: tcp_v4_err() should be more careful
[ Upstream commit 2c4cc9712364c051b1de2d175d5fbea6be948ebf ]

ICMP handlers are not very often stressed, we should
make them more resilient to bugs that might surface in
the future.

If there is no packet in retransmit queue, we should
avoid a NULL deref.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23 09:07:26 +01:00
Eric Dumazet
df1b583c16 tcp: clear icsk_backoff in tcp_write_queue_purge()
[ Upstream commit 04c03114be82194d4a4858d41dba8e286ad1787c ]

soukjin bae reported a crash in tcp_v4_err() handling
ICMP_DEST_UNREACH after tcp_write_queue_head(sk)
returned a NULL pointer.

Current logic should have prevented this :

  if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
      !icsk->icsk_backoff || fastopen)
      break;

Problem is the write queue might have been purged
and icsk_backoff has not been cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-23 09:07:26 +01:00
Lorenzo Bianconi
1764111c99 net: ipv4: use a dedicated counter for icmp_v4 redirect packets
[ Upstream commit c09551c6ff7fe16a79a42133bcecba5fc2fc3291 ]

According to the algorithm described in the comment block at the
beginning of ip_rt_send_redirect, the host should try to send
'ip_rt_redirect_number' ICMP redirect packets with an exponential
backoff and then stop sending them at all assuming that the destination
ignores redirects.
If the device has previously sent some ICMP error packets that are
rate-limited (e.g TTL expired) and continues to receive traffic,
the redirect packets will never be transmitted. This happens since
peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
and so it will never be reset even if the redirect silence timeout
(ip_rt_redirect_silence) has elapsed without receiving any packet
requiring redirects.

Fix it by using a dedicated counter for the number of ICMP redirect
packets that has been sent by the host

I have not been able to identify a given commit that introduced the
issue since ip_rt_send_redirect implements the same rate-limiting
algorithm from commit 1da177e4c3 ("Linux-2.6.12-rc2")

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-23 09:07:24 +01:00
Lorenzo Bianconi
0a198e0bb8 net: ip_gre: use erspan key field for tunnel lookup
[ Upstream commit cb73ee40b1b381eaf3749e6dbeed567bb38e5258 ]

Use ERSPAN key header field as tunnel key in gre_parse_header routine
since ERSPAN protocol sets the key field of the external GRE header to
0 resulting in a tunnel lookup fail in ip6gre_err.
In addition remove key field parsing and pskb_may_pull check in
erspan_rcv and ip6erspan_rcv

Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06 17:30:06 +01:00
Lorenzo Bianconi
897ea28bd2 net: ip_gre: always reports o_key to userspace
[ Upstream commit feaf5c796b3f0240f10d0d6d0b686715fd58a05b ]

Erspan protocol (version 1 and 2) relies on o_key to configure
session id header field. However TUNNEL_KEY bit is cleared in
erspan_xmit since ERSPAN protocol does not set the key field
of the external GRE header and so the configured o_key is not reported
to userspace. The issue can be triggered with the following reproducer:

$ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \
    key 1 seq erspan_ver 1
$ip link set erspan1 up
$ip -d link sh erspan1

erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT
  link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
  erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0

Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
ipgre_fill_info

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06 17:30:06 +01:00
Greg Kroah-Hartman
8c763a3cf5 Fix "net: ipv4: do not handle duplicate fragments as overlapping"
ade446403bfb ("net: ipv4: do not handle duplicate fragments as
overlapping") was backported to many stable trees, but it had a problem
that was "accidentally" fixed by the upstream commit 0ff89efb5246 ("ip:
fail fast on IP defrag errors")

This is the fixup for that problem as we do not want the larger patch in
the older stable trees.

Fixes: ade446403bfb ("net: ipv4: do not handle duplicate fragments as overlapping")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06 17:30:05 +01:00
Willem de Bruijn
2cade15d58 tcp: allow MSG_ZEROCOPY transmission also in CLOSE_WAIT state
[ Upstream commit 13d7f46386e060df31b727c9975e38306fa51e7a ]

TCP transmission with MSG_ZEROCOPY fails if the peer closes its end of
the connection and so transitions this socket to CLOSE_WAIT state.

Transmission in close wait state is acceptable. Other similar tests in
the stack (e.g., in FastOpen) accept both states. Relax this test, too.

Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg276886.html
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg227390.html
Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Reported-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
CC: Yuchung Cheng <ycheng@google.com>
CC: Neal Cardwell <ncardwell@google.com>
CC: Soheil Hassas Yeganeh <soheil@google.com>
CC: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:33 +01:00
Xin Long
552cd931b4 erspan: build the header with the right proto according to erspan_ver
[ Upstream commit 20704bd1633dd5afb29a321d3a615c9c8e9c9d05 ]

As said in draft-foschiano-erspan-03#section4:

   Different frame variants known as "ERSPAN Types" can be
   distinguished based on the GRE "Protocol Type" field value: Type I
   and II's value is 0x88BE while Type III's is 0x22EB [ETYPES].

So set it properly in erspan_xmit() according to erspan_ver. While at
it, also remove the unused parameter 'proto' in erspan_fb_xmit().

Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:33 +01:00
Willem de Bruijn
e3fa624ee7 udp: with udp_segment release on error path
[ Upstream commit 0f149c9fec3cd720628ecde83bfc6f64c1e7dcb6 ]

Failure __ip_append_data triggers udp_flush_pending_frames, but these
tests happen later. The skb must be freed directly.

Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:33 +01:00
Ido Schimmel
adbf7e5809 net: ipv4: Fix memory leak in network namespace dismantle
[ Upstream commit f97f4dd8b3bb9d0993d2491e0f22024c68109184 ]

IPv4 routing tables are flushed in two cases:

1. In response to events in the netdev and inetaddr notification chains
2. When a network namespace is being dismantled

In both cases only routes associated with a dead nexthop group are
flushed. However, a nexthop group will only be marked as dead in case it
is populated with actual nexthops using a nexthop device. This is not
the case when the route in question is an error route (e.g.,
'blackhole', 'unreachable').

Therefore, when a network namespace is being dismantled such routes are
not flushed and leaked [1].

To reproduce:
# ip netns add blue
# ip -n blue route add unreachable 192.0.2.0/24
# ip netns del blue

Fix this by not skipping error routes that are not marked with
RTNH_F_DEAD when flushing the routing tables.

To prevent the flushing of such routes in case #1, add a parameter to
fib_table_flush() that indicates if the table is flushed as part of
namespace dismantle or not.

Note that this problem does not exist in IPv6 since error routes are
associated with the loopback device.

[1]
unreferenced object 0xffff888066650338 (size 56):
  comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
    e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
  backtrace:
    [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
    [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
    [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
    [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
    [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
    [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
    [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
    [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
    [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
    [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000003a8b605b>] 0xffffffffffffffff
unreferenced object 0xffff888061621c88 (size 48):
  comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
  hex dump (first 32 bytes):
    6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
    6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
  backtrace:
    [<00000000733609e3>] fib_table_insert+0x978/0x1500
    [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
    [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
    [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
    [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
    [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
    [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
    [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
    [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
    [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
    [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000003a8b605b>] 0xffffffffffffffff

Fixes: 8cced9eff1 ("[NETNS]: Enable routing configuration in non-initial namespace.")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:32 +01:00
Ross Lagerwall
40f2f08030 net: Fix usage of pskb_trim_rcsum
[ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ]

In certain cases, pskb_trim_rcsum() may change skb pointers.
Reinitialize header pointers afterwards to avoid potential
use-after-frees. Add a note in the documentation of
pskb_trim_rcsum(). Found by KASAN.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:31 +01:00
Taehee Yoo
9d51378a68 netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine
[ Upstream commit 5a86d68bcf02f2d1e9a5897dd482079fd5f75e7f ]

When network namespace is destroyed, cleanup_net() is called.
cleanup_net() holds pernet_ops_rwsem then calls each ->exit callback.
So that clusterip_tg_destroy() is called by cleanup_net().
And clusterip_tg_destroy() calls unregister_netdevice_notifier().

But both cleanup_net() and clusterip_tg_destroy() hold same
lock(pernet_ops_rwsem). hence deadlock occurrs.

After this patch, only 1 notifier is registered when module is inserted.
And all of configs are added to per-net list.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 bash
   %ip link set lo up
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	-j CLUSTERIP --new --hashmode sourceip \
	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %exit
   %ip netns del vm1

splat looks like:
[  341.809674] ============================================
[  341.809674] WARNING: possible recursive locking detected
[  341.809674] 4.19.0-rc5+ #16 Tainted: G        W
[  341.809674] --------------------------------------------
[  341.809674] kworker/u4:2/87 is trying to acquire lock:
[  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x8c/0x460
[  341.809674]
[  341.809674] but task is already holding lock:
[  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
[  341.809674]
[  341.809674] other info that might help us debug this:
[  341.809674]  Possible unsafe locking scenario:
[  341.809674]
[  341.809674]        CPU0
[  341.809674]        ----
[  341.809674]   lock(pernet_ops_rwsem);
[  341.809674]   lock(pernet_ops_rwsem);
[  341.809674]
[  341.809674]  *** DEADLOCK ***
[  341.809674]
[  341.809674]  May be due to missing lock nesting notation
[  341.809674]
[  341.809674] 3 locks held by kworker/u4:2/87:
[  341.809674]  #0: 00000000d9df6c92 ((wq_completion)"%s""netns"){+.+.}, at: process_one_work+0xafe/0x1de0
[  341.809674]  #1: 00000000c2cbcee2 (net_cleanup_work){+.+.}, at: process_one_work+0xb60/0x1de0
[  341.809674]  #2: 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
[  341.809674]
[  341.809674] stack backtrace:
[  341.809674] CPU: 1 PID: 87 Comm: kworker/u4:2 Tainted: G        W         4.19.0-rc5+ #16
[  341.809674] Workqueue: netns cleanup_net
[  341.809674] Call Trace:
[ ... ]
[  342.070196]  down_write+0x93/0x160
[  342.070196]  ? unregister_netdevice_notifier+0x8c/0x460
[  342.070196]  ? down_read+0x1e0/0x1e0
[  342.070196]  ? sched_clock_cpu+0x126/0x170
[  342.070196]  ? find_held_lock+0x39/0x1c0
[  342.070196]  unregister_netdevice_notifier+0x8c/0x460
[  342.070196]  ? register_netdevice_notifier+0x790/0x790
[  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
[  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
[  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
[  342.070196]  ? trace_hardirqs_on+0x93/0x210
[  342.070196]  ? __bpf_trace_preemptirq_template+0x10/0x10
[  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
[  342.123094]  clusterip_tg_destroy+0x3ad/0x650 [ipt_CLUSTERIP]
[  342.123094]  ? clusterip_net_init+0x3d0/0x3d0 [ipt_CLUSTERIP]
[  342.123094]  ? cleanup_match+0x17d/0x200 [ip_tables]
[  342.123094]  ? xt_unregister_table+0x215/0x300 [x_tables]
[  342.123094]  ? kfree+0xe2/0x2a0
[  342.123094]  cleanup_entry+0x1d5/0x2f0 [ip_tables]
[  342.123094]  ? cleanup_match+0x200/0x200 [ip_tables]
[  342.123094]  __ipt_unregister_table+0x9b/0x1a0 [ip_tables]
[  342.123094]  iptable_filter_net_exit+0x43/0x80 [iptable_filter]
[  342.123094]  ops_exit_list.isra.10+0x94/0x140
[  342.123094]  cleanup_net+0x45b/0x900
[ ... ]

Fixes: 202f59afd4 ("netfilter: ipt_CLUSTERIP: do not hold dev")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:32:40 +01:00
Taehee Yoo
bb7b6c49cc netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine
[ Upstream commit b12f7bad5ad3724d19754390a3e80928525c0769 ]

When network namespace is destroyed, both clusterip_tg_destroy() and
clusterip_net_exit() are called. and clusterip_net_exit() is called
before clusterip_tg_destroy().
Hence cleanup check code in clusterip_net_exit() doesn't make sense.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 bash
   %ip link set lo up
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	-j CLUSTERIP --new --hashmode sourceip \
	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %exit
   %ip netns del vm1

splat looks like:
[  341.184508] WARNING: CPU: 1 PID: 87 at net/ipv4/netfilter/ipt_CLUSTERIP.c:840 clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
[  341.184850] Modules linked in: ipt_CLUSTERIP nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter ip_tables x_tables
[  341.184850] CPU: 1 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc5+ #16
[  341.227509] Workqueue: netns cleanup_net
[  341.227509] RIP: 0010:clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
[  341.227509] Code: 0f 85 7f fe ff ff 48 c7 c2 80 64 2c c0 be a8 02 00 00 48 c7 c7 a0 63 2c c0 c6 05 18 6e 00 00 01 e8 bc 38 ff f5 e9 5b fe ff ff <0f> 0b e9 33 ff ff ff e8 4b 90 50 f6 e9 2d fe ff ff 48 89 df e8 de
[  341.227509] RSP: 0018:ffff88011086f408 EFLAGS: 00010202
[  341.227509] RAX: dffffc0000000000 RBX: 1ffff1002210de85 RCX: 0000000000000000
[  341.227509] RDX: 1ffff1002210de85 RSI: ffff880110813be8 RDI: ffffed002210de58
[  341.227509] RBP: ffff88011086f4d0 R08: 0000000000000000 R09: 0000000000000000
[  341.227509] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1002210de81
[  341.227509] R13: ffff880110625a48 R14: ffff880114cec8c8 R15: 0000000000000014
[  341.227509] FS:  0000000000000000(0000) GS:ffff880116600000(0000) knlGS:0000000000000000
[  341.227509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  341.227509] CR2: 00007f11fd38e000 CR3: 000000013ca16000 CR4: 00000000001006e0
[  341.227509] Call Trace:
[  341.227509]  ? __clusterip_config_find+0x460/0x460 [ipt_CLUSTERIP]
[  341.227509]  ? default_device_exit+0x1ca/0x270
[  341.227509]  ? remove_proc_entry+0x1cd/0x390
[  341.227509]  ? dev_change_net_namespace+0xd00/0xd00
[  341.227509]  ? __init_waitqueue_head+0x130/0x130
[  341.227509]  ops_exit_list.isra.10+0x94/0x140
[  341.227509]  cleanup_net+0x45b/0x900
[ ... ]

Fixes: 613d0776d3 ("netfilter: exit_net cleanup check added")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:32:40 +01:00
Taehee Yoo
744383c88e netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set
[ Upstream commit 06aa151ad1fc74a49b45336672515774a678d78d ]

If same destination IP address config is already existing, that config is
just used. MAC address also should be same.
However, there is no MAC address checking routine.
So that MAC address checking routine is added.

test commands:
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	   -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	   -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1

After this patch, above commands are disallowed.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:32:40 +01:00
Willem de Bruijn
eb02c17fcc ip: on queued skb use skb_header_pointer instead of pskb_may_pull
[ Upstream commit 4a06fa67c4da20148803525151845276cdb995c1 ]

Commit 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call
pskb_may_pull") avoided a read beyond the end of the skb linear
segment by calling pskb_may_pull.

That function can trigger a BUG_ON in pskb_expand_head if the skb is
shared, which it is when when peeking. It can also return ENOMEM.

Avoid both by switching to safer skb_header_pointer.

Fixes: 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:31 +01:00
Yuchung Cheng
d7fe54c17a tcp: change txhash on SYN-data timeout
[ Upstream commit c5715b8fabfca0ef85903f8bad2189940ed41cc8 ]

Previously upon SYN timeouts the sender recomputes the txhash to
try a different path. However this does not apply on the initial
timeout of SYN-data (active Fast Open). Therefore an active IPv6
Fast Open connection may incur one second RTO penalty to take on
a new path after the second SYN retransmission uses a new flow label.

This patch removes this undesirable behavior so Fast Open changes
the flow label just like the regular connections. This also helps
avoid falsely disabling Fast Open on the sender which triggers
after two consecutive SYN timeouts on Fast Open.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:30 +01:00
Eric Dumazet
e521703487 tcp: fix a race in inet_diag_dump_icsk()
[ Upstream commit f0c928d878e7d01b613c9ae5c971a6b1e473a938 ]

Alexei reported use after frees in inet_diag_dump_icsk() [1]

Because we use refcount_set() when various sockets are setup and
inserted into ehash, we also need to make sure inet_diag_dump_icsk()
wont race with the refcount_set() operations.

Jonathan Lemon sent a patch changing net_twsk_hashdance() but
other spots would need risky changes.

Instead, fix inet_diag_dump_icsk() as this bug came with
linux-4.10 only.

[1] Quoting Alexei :

First something iterating over sockets finds already freed tw socket:

refcount_t: increment on 0; use-after-free.
WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
RIP: 0010:refcount_inc+0x26/0x30
RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
FS:  00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag]  // sock_hold(sk); in net/ipv4/inet_diag.c:1002
 ? kmalloc_large_node+0x37/0x70
 ? __kmalloc_node_track_caller+0x1cb/0x260
 ? __alloc_skb+0x72/0x1b0
 ? __kmalloc_reserve.isra.40+0x2e/0x80
 __inet_diag_dump+0x3b/0x80 [inet_diag]
 netlink_dump+0x116/0x2a0
 netlink_recvmsg+0x205/0x3c0
 sock_read_iter+0x89/0xd0
 __vfs_read+0xf7/0x140
 vfs_read+0x8a/0x140
 SyS_read+0x3f/0xa0
 do_syscall_64+0x5a/0x100

then a minute later twsk timer fires and hits two bad refcnts
for this freed socket:

refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
Modules linked in:
RIP: 0010:refcount_dec+0x2e/0x40
RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 inet_twsk_kill+0x9d/0xc0  // inet_twsk_bind_unhash(tw, hashinfo);
 call_timer_fn+0x29/0x110
 run_timer_softirq+0x36b/0x3a0

refcount_t: underflow; use-after-free.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
RIP: 0010:refcount_sub_and_test+0x46/0x50
RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 inet_twsk_put+0x12/0x20  // inet_twsk_put(tw);
 call_timer_fn+0x29/0x110
 run_timer_softirq+0x36b/0x3a0

Fixes: 67db3e4bfb ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:33 +01:00
Michal Kubecek
d5f9565c8d net: ipv4: do not handle duplicate fragments as overlapping
[ Upstream commit ade446403bfb79d3528d56071a84b15351a139ad ]

Since commit 7969e5c40d ("ip: discard IPv4 datagrams with overlapping
segments.") IPv4 reassembly code drops the whole queue whenever an
overlapping fragment is received. However, the test is written in a way
which detects duplicate fragments as overlapping so that in environments
with many duplicate packets, fragmented packets may be undeliverable.

Add an extra test and for (potentially) duplicate fragment, only drop the
new fragment rather than the whole queue. Only starting offset and length
are checked, not the contents of the fragments as that would be too
expensive. For similar reason, linear list ("run") of a rbtree node is not
iterated, we only check if the new fragment is a subset of the interval
covered by existing consecutive fragments.

v2: instead of an exact check iterating through linear list of an rbtree
node, only check if the new fragment is subset of the "run" (suggested
by Eric Dumazet)

Fixes: 7969e5c40d ("ip: discard IPv4 datagrams with overlapping segments.")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:32 +01:00
Eric Dumazet
281731c817 net: clear skb->tstamp in forwarding paths
[ Upstream commit 8203e2d844d34af247a151d8ebd68553a6e91785 ]

Sergey reported that forwarding was no longer working
if fq packet scheduler was used.

This is caused by the recent switch to EDT model, since incoming
packets might have been timestamped by __net_timestamp()

__net_timestamp() uses ktime_get_real(), while fq expects packets
using CLOCK_MONOTONIC base.

The fix is to clear skb->tstamp in forwarding paths.

Fixes: 80b14dee2b ("net: Add a new socket option for a future transmit time.")
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sergey Matyukevich <geomatsi@gmail.com>
Tested-by: Sergey Matyukevich <geomatsi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:31 +01:00
Willem de Bruijn
cde81154f8 ip: validate header length on virtual device xmit
[ Upstream commit cb9f1b783850b14cbd7f87d061d784a666dfba1f ]

KMSAN detected read beyond end of buffer in vti and sit devices when
passing truncated packets with PF_PACKET. The issue affects additional
ip tunnel devices.

Extend commit 76c0ddd8c3 ("ip6_tunnel: be careful when accessing the
inner header") and commit ccfec9e5cb ("ip_tunnel: be careful when
accessing the inner header").

Move the check to a separate helper and call at the start of each
ndo_start_xmit function in net/ipv4 and net/ipv6.

Minor changes:
- convert dev_kfree_skb to kfree_skb on error path,
  as dev_kfree_skb calls consume_skb which is not for error paths.
- use pskb_network_may_pull even though that is pedantic here,
  as the same as pskb_may_pull for devices without llheaders.
- do not cache ipv6 hdrs if used only once
  (unsafe across pskb_may_pull, was more relevant to earlier patch)

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:31 +01:00
Gustavo A. R. Silva
360fb1db92 ipv4: Fix potential Spectre v1 vulnerability
[ Upstream commit 5648451e30a0d13d11796574919a359025d52cce ]

vr.vifi is indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

net/ipv4/ipmr.c:1616 ipmr_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)
net/ipv4/ipmr.c:1690 ipmr_compat_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)

Fix this by sanitizing vr.vifi before using it to index mrt->vif_table'

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:31 +01:00
Eric Dumazet
d265655ae4 tcp: lack of available data can also cause TSO defer
commit f9bfe4e6a9d08d405fe7b081ee9a13e649c97ecf upstream.

tcp_tso_should_defer() can return true in three different cases :

 1) We are cwnd-limited
 2) We are rwnd-limited
 3) We are application limited.

Neal pointed out that my recent fix went too far, since
it assumed that if we were not in 1) case, we must be rwnd-limited

Fix this by properly populating the is_cwnd_limited and
is_rwnd_limited booleans.

After this change, we can finally move the silly check for FIN
flag only for the application-limited case.

The same move for EOR bit will be handled in net-next,
since commit 1c09f7d073b1 ("tcp: do not try to defer skbs
with eor mark (MSG_EOR)") is scheduled for linux-4.21

Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100
and checking none of them was rwnd_limited in the chrono_stat
output from "ss -ti" command.

Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:42 +01:00
Taehee Yoo
5517d4c6dc netfilter: nat: fix double register in masquerade modules
[ Upstream commit 095faf45e64be00bff4da2d6182dface3d69c9b7 ]

There is a reference counter to ensure that masquerade modules register
notifiers only once. However, the existing reference counter approach is
not safe, test commands are:

   while :
   do
   	   modprobe ip6t_MASQUERADE &
	   modprobe nft_masq_ipv6 &
	   modprobe -rv ip6t_MASQUERADE &
	   modprobe -rv nft_masq_ipv6 &
   done

numbers below represent the reference counter.
--------------------------------------------------------
CPU0        CPU1        CPU2        CPU3        CPU4
[insmod]    [insmod]    [rmmod]     [rmmod]     [insmod]
--------------------------------------------------------
0->1
register    1->2
            returns     2->1
			returns     1->0
                                                0->1
                                                register <--
                                    unregister
--------------------------------------------------------

The unregistation of CPU3 should be processed before the
registration of CPU4.

In order to fix this, use a mutex instead of reference counter.

splat looks like:
[  323.869557] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1381]
[  323.869574] Modules linked in: nf_tables(+) nf_nat_ipv6(-) nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 n]
[  323.869574] irq event stamp: 194074
[  323.898930] hardirqs last  enabled at (194073): [<ffffffff90004a0d>] trace_hardirqs_on_thunk+0x1a/0x1c
[  323.898930] hardirqs last disabled at (194074): [<ffffffff90004a29>] trace_hardirqs_off_thunk+0x1a/0x1c
[  323.898930] softirqs last  enabled at (182132): [<ffffffff922006ec>] __do_softirq+0x6ec/0xa3b
[  323.898930] softirqs last disabled at (182109): [<ffffffff90193426>] irq_exit+0x1a6/0x1e0
[  323.898930] CPU: 0 PID: 1381 Comm: modprobe Not tainted 4.20.0-rc2+ #27
[  323.898930] RIP: 0010:raw_notifier_chain_register+0xea/0x240
[  323.898930] Code: 3c 03 0f 8e f2 00 00 00 44 3b 6b 10 7f 4d 49 bc 00 00 00 00 00 fc ff df eb 22 48 8d 7b 10 488
[  323.898930] RSP: 0018:ffff888101597218 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[  323.898930] RAX: 0000000000000000 RBX: ffffffffc04361c0 RCX: 0000000000000000
[  323.898930] RDX: 1ffffffff26132ae RSI: ffffffffc04aa3c0 RDI: ffffffffc04361d0
[  323.898930] RBP: ffffffffc04361c8 R08: 0000000000000000 R09: 0000000000000001
[  323.898930] R10: ffff8881015972b0 R11: fffffbfff26132c4 R12: dffffc0000000000
[  323.898930] R13: 0000000000000000 R14: 1ffff110202b2e44 R15: ffffffffc04aa3c0
[  323.898930] FS:  00007f813ed41540(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
[  323.898930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  323.898930] CR2: 0000559bf2c9f120 CR3: 000000010bc80000 CR4: 00000000001006f0
[  323.898930] Call Trace:
[  323.898930]  ? atomic_notifier_chain_register+0x2d0/0x2d0
[  323.898930]  ? down_read+0x150/0x150
[  323.898930]  ? sched_clock_cpu+0x126/0x170
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  register_netdevice_notifier+0xbb/0x790
[  323.898930]  ? __dev_close_many+0x2d0/0x2d0
[  323.898930]  ? __mutex_unlock_slowpath+0x17f/0x740
[  323.898930]  ? wait_for_completion+0x710/0x710
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  ? up_write+0x6c/0x210
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  324.127073]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  324.127073]  nft_chain_filter_init+0x1e/0xe8a [nf_tables]
[  324.127073]  nf_tables_module_init+0x37/0x92 [nf_tables]
[ ... ]

Fixes: 8dd33cc93e ("netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables")
Fixes: be6b635cd6 ("netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:37 +01:00
Taehee Yoo
18218f827e netfilter: add missing error handling code for register functions
[ Upstream commit 584eab291c67894cb17cc87544b9d086228ea70f ]

register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:36 +01:00
Yuchung Cheng
bbc83e8d08 tcp: fix NULL ref in tail loss probe
[ Upstream commit b2b7af861122a0c0f6260155c29a1b2e594cd5b5 ]

TCP loss probe timer may fire when the retranmission queue is empty but
has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
tcp_rearm_rto which triggers NULL pointer reference by fetching the
retranmission queue head in its sub-routines.

Add a more detailed warning to help catch the root cause of the inflight
accounting inconsistency.

Reported-by: Rafael Tinoco <rafael.tinoco@linaro.org>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:28 +01:00
Eric Dumazet
03b271cb91 tcp: Do not underestimate rwnd_limited
[ Upstream commit 41727549de3e7281feb174d568c6e46823db8684 ]

If available rwnd is too small, tcp_tso_should_defer()
can decide it is worth waiting before splitting a TSO packet.

This really means we are rwnd limited.

Fixes: 5615f88614 ("tcp: instrument how long TCP is limited by receive window")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:28 +01:00
Edward Cree
7fafda16bb net: use skb_list_del_init() to remove from RX sublists
[ Upstream commit 22f6bbb7bcfcef0b373b0502a7ff390275c575dd ]

list_del() leaves the skb->next pointer poisoned, which can then lead to
 a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
 forwarding bridge on sfc as per:

========
$ ovs-vsctl show
5dfd9c47-f04b-4aaa-aa96-4fbb0a522a30
    Bridge "br0"
        Port "br0"
            Interface "br0"
                type: internal
        Port "enp6s0f0"
            Interface "enp6s0f0"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key="1", local_ip="10.0.0.5", remote_ip="10.0.0.4"}
    ovs_version: "2.5.0"
========
(where 10.0.0.5 is an address on enp6s0f1)
and sending traffic across it will lead to the following panic:
========
general protection fault: 0000 [#1] SMP PTI
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-rc3-ehc+ #701
Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013
RIP: 0010:dev_hard_start_xmit+0x38/0x200
Code: 53 48 89 fb 48 83 ec 20 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 ab 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 c7 d1 b3 00 4d 85 f6 0f 95
RSP: 0018:ffff888627b437e0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: dead000000000100 RCX: ffff88862279c000
RDX: ffff888614a342c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888618a88000 R08: 0000000000000001 R09: 00000000000003e8
R10: 0000000000000000 R11: ffff888614a34140 R12: 0000000000000000
R13: 0000000000000062 R14: dead000000000100 R15: ffff888616430000
FS:  0000000000000000(0000) GS:ffff888627b40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6d2bc6d000 CR3: 000000000200a000 CR4: 00000000000006e0
Call Trace:
 <IRQ>
 __dev_queue_xmit+0x623/0x870
 ? masked_flow_lookup+0xf7/0x220 [openvswitch]
 ? ep_poll_callback+0x101/0x310
 do_execute_actions+0xaba/0xaf0 [openvswitch]
 ? __wake_up_common+0x8a/0x150
 ? __wake_up_common_lock+0x87/0xc0
 ? queue_userspace_packet+0x31c/0x5b0 [openvswitch]
 ovs_execute_actions+0x47/0x120 [openvswitch]
 ovs_dp_process_packet+0x7d/0x110 [openvswitch]
 ovs_vport_receive+0x6e/0xd0 [openvswitch]
 ? dst_alloc+0x64/0x90
 ? rt_dst_alloc+0x50/0xd0
 ? ip_route_input_slow+0x19a/0x9a0
 ? __udp_enqueue_schedule_skb+0x198/0x1b0
 ? __udp4_lib_rcv+0x856/0xa30
 ? __udp4_lib_rcv+0x856/0xa30
 ? cpumask_next_and+0x19/0x20
 ? find_busiest_group+0x12d/0xcd0
 netdev_frame_hook+0xce/0x150 [openvswitch]
 __netif_receive_skb_core+0x205/0xae0
 __netif_receive_skb_list_core+0x11e/0x220
 netif_receive_skb_list+0x203/0x460
 ? __efx_rx_packet+0x335/0x5e0 [sfc]
 efx_poll+0x182/0x320 [sfc]
 net_rx_action+0x294/0x3c0
 __do_softirq+0xca/0x297
 irq_exit+0xa6/0xb0
 do_IRQ+0x54/0xd0
 common_interrupt+0xf/0xf
 </IRQ>
========
So, in all listified-receive handling, instead pull skbs off the lists with
 skb_list_del_init().

Fixes: 9af86f9338 ("net: core: fix use-after-free in __netif_receive_skb_list_core")
Fixes: 7da517a3bc ("net: core: Another step of skb receive list processing")
Fixes: a4ca8b7df7 ("net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()")
Fixes: d8269e2cbf ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:27 +01:00
Jiri Wiesner
ffe5754d28 ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
[ Upstream commit ebaf39e6032faf77218220707fc3fa22487784e0 ]

The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.

Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:

 PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
  #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
  #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
  #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
  #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
  #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
 #10 [ffff880621563e38] process_one_work at ffffffff81096f14

It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.

The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
Acked-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:26 +01:00
Eric Dumazet
aaa7e45c00 tcp: defer SACK compression after DupThresh
[ Upstream commit 86de5921a3d5dd246df661e09bdd0a6131b39ae3 ]

Jean-Louis reported a TCP regression and bisected to recent SACK
compression.

After a loss episode (receiver not able to keep up and dropping
packets because its backlog is full), linux TCP stack is sending
a single SACK (DUPACK).

Sender waits a full RTO timer before recovering losses.

While RFC 6675 says in section 5, "Algorithm Details",

   (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
       indicating at least three segments have arrived above the current
       cumulative acknowledgment point, which is taken to indicate loss
       -- go to step (4).
...
   (4) Invoke fast retransmit and enter loss recovery as follows:

there are old TCP stacks not implementing this strategy, and
still counting the dupacks before starting fast retransmit.

While these stacks probably perform poorly when receivers implement
LRO/GRO, we should be a little more gentle to them.

This patch makes sure we do not enable SACK compression unless
3 dupacks have been sent since last rcv_nxt update.

Ideally we should even rearm the timer to send one or two
more DUPACK if no more packets are coming, but that will
be work aiming for linux-4.21.

Many thanks to Jean-Louis for bisecting the issue, providing
packet captures and testing this patch.

Fixes: 5d9f4262b7 ("tcp: add SACK compression")
Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05 19:32:00 +01:00
Eric Dumazet
cddcc9959a tcp: do not release socket ownership in tcp_close()
commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
in tcp_close()

While a socket is being closed, it is very possible other
threads find it in rtnetlink dump.

tcp_get_info() will acquire the socket lock for a short amount
of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
enough to trigger the warning.

Fixes: 67db3e4bfb ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-01 09:37:28 +01:00
Eric Dumazet
115973a608 inet: frags: better deal with smp races
[ Upstream commit 0d5b9311baf27bb545f187f12ecfd558220c607d ]

Multiple cpus might attempt to insert a new fragment in rhashtable,
if for example RPS is buggy, as reported by 배석진 in
https://patchwork.ozlabs.org/patch/994601/

We use rhashtable_lookup_get_insert_key() instead of
rhashtable_insert_fast() to let cpus losing the race
free their own inet_frag_queue and use the one that
was inserted by another cpu.

Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-23 08:17:05 +01:00
Stephen Mallon
7e6782271f tcp: Fix SOF_TIMESTAMPING_RX_HARDWARE to use the latest timestamp during TCP coalescing
[ Upstream commit cadf9df27e7cf40e390e060a1c71bb86ecde798b ]

During tcp coalescing ensure that the skb hardware timestamp refers to the
highest sequence number data.
Previously only the software timestamp was updated during coalescing.

Signed-off-by: Stephen Mallon <stephen.mallon@sydney.edu.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-23 08:17:03 +01:00
Sabrina Dubroca
60258098e8 ip_tunnel: don't force DF when MTU is locked
[ Upstream commit 16f7eb2b77b55da816c4e207f3f9440a8cafc00a ]

The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.

This patch makes setting the DF bit conditional on the route's MTU
locking state.

This issue seems to be older than git history.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-23 08:17:02 +01:00
Stefan Nuernberger
247a9fa4b4 net/ipv4: defensive cipso option parsing
commit 076ed3da0c9b2f88d9157dbe7044a45641ae369e upstream.

commit 40413955ee ("Cipso: cipso_v4_optptr enter infinite loop") fixed
a possible infinite loop in the IP option parsing of CIPSO. The fix
assumes that ip_options_compile filtered out all zero length options and
that no other one-byte options beside IPOPT_END and IPOPT_NOOP exist.
While this assumption currently holds true, add explicit checks for zero
length and invalid length options to be safe for the future. Even though
ip_options_compile should have validated the options, the introduction of
new one-byte options can still confuse this code without the additional
checks.

Signed-off-by: Stefan Nuernberger <snu@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Simon Veith <sveith@amazon.de>
Cc: stable@vger.kernel.org
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:41 -08:00
Cong Wang
54d6a82d05 net: drop skb on failure in ip_check_defrag()
[ Upstream commit 7de414a9dd91426318df7b63da024b2b07e53df5 ]

Most callers of pskb_trim_rcsum() simply drop the skb when
it fails, however, ip_check_defrag() still continues to pass
the skb up to stack. This is suspicious.

In ip_check_defrag(), after we learn the skb is an IP fragment,
passing the skb to callers makes no sense, because callers expect
fragments are defrag'ed on success. So, dropping the skb when we
can't defrag it is reasonable.

Note, prior to commit 88078d98d1, this is not a big problem as
checksum will be fixed up anyway. After it, the checksum is not
correct on failure.

Found this during code review.

Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04 14:50:53 +01:00
Karsten Graul
fd54c188b3 Revert "net: simplify sock_poll_wait"
[ Upstream commit 89ab066d4229acd32e323f1569833302544a4186 ]

This reverts commit dd979b4df8.

This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
internal TCP socket for the initial handshake with the remote peer.
Whenever the SMC connection can not be established this TCP socket is
used as a fallback. All socket operations on the SMC socket are then
forwarded to the TCP socket. In case of poll, the file->private_data
pointer references the SMC socket because the TCP socket has no file
assigned. This causes tcp_poll to wait on the wrong socket.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04 14:50:51 +01:00
Sean Tranchetti
4fb0dc97de net: udp: fix handling of CHECKSUM_COMPLETE packets
[ Upstream commit db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 ]

Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
incorrect for any packet that has an incorrect checksum value.

udp4/6_csum_init() will both make a call to
__skb_checksum_validate_complete() to initialize/validate the csum
field when receiving a CHECKSUM_COMPLETE packet. When this packet
fails validation, skb->csum will be overwritten with the pseudoheader
checksum so the packet can be fully validated by software, but the
skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
the stack can later warn the user about their hardware spewing bad
checksums. Unfortunately, leaving the SKB in this state can cause
problems later on in the checksum calculation.

Since the the packet is still marked as CHECKSUM_COMPLETE,
udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
from skb->csum instead of adding it, leaving us with a garbage value
in that field. Once we try to copy the packet to userspace in the
udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
to checksum the packet data and add it in the garbage skb->csum value
to perform our final validation check.

Since the value we're validating is not the proper checksum, it's possible
that the folded value could come out to 0, causing us not to drop the
packet. Instead, we believe that the packet was checksummed incorrectly
by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
to warn the user with netdev_rx_csum_fault(skb->dev);

Unfortunately, since this is the UDP path, skb->dev has been overwritten
by skb->dev_scratch and is no longer a valid pointer, so we end up
reading invalid memory.

This patch addresses this problem in two ways:
	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
	   from skb_copy_and_csum_datagram_msg(). Since this gets called
	   from the UDP path where skb->dev has been overwritten, we have
	   no way of knowing if the pointer is still valid. Also for the
	   sake of consistency with the other uses of
	   netdev_rx_csum_fault(), don't attempt to call it if the
	   packet was checksummed by software.

	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
	   If we receive a packet that's CHECKSUM_COMPLETE that fails
	   verification (i.e. skb->csum_valid == 0), check who performed
	   the calculation. It's possible that the checksum was done in
	   software by the network stack earlier (such as Netfilter's
	   CONNTRACK module), and if that says the checksum is bad,
	   we can drop the packet immediately instead of waiting until
	   we try and copy it to userspace. Otherwise, we need to
	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
	   no longer contains the full packet checksum after the
	   call to __skb_checksum_validate_complete().

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Fixes: c84d949057 ("udp: copy skb->truesize in the first cache line")
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04 14:50:51 +01:00
Nikolay Aleksandrov
eddf016b91 net: ipmr: fix unresolved entry dumps
If the skb space ends in an unresolved entry while dumping we'll miss
some unresolved entries. The reason is due to zeroing the entry counter
between dumping resolved and unresolved mfc entries. We should just
keep counting until the whole table is dumped and zero when we move to
the next as we have a separate table counter.

Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: 8fb472c09b ("ipmr: improve hash scalability")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-17 22:35:42 -07:00
Sabrina Dubroca
28d35bcdd3 net: ipv4: don't let PMTU updates increase route MTU
When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
received, we must clamp its value. However, we can receive a PMTU
exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
increase in PMTU.

To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.

Before this patch, in case of an update, the exception's MTU would
always change. Now, an exception can have only its lock flag updated,
but not the MTU, so we need to add a check on locking to the following
"is this exception getting updated, or close to expiring?" test.

Fixes: d52e5a7e7c ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:44:46 -07:00
Sabrina Dubroca
af7d6cce53 net: ipv4: update fnhe_pmtu when first hop's MTU changes
Since commit 5aad1de5ea ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.

As Stefano described in commit e9fa1495d7 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
 - if the local MTU is now lower than the PMTU, that PMTU is now
   incorrect
 - if the local MTU was the lowest value in the path, and is increased,
   we might discover a higher PMTU

Similarly to what commit e9fa1495d7 did for IPv6, update PMTU in those
cases.

If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.

To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.

Fixes: 5aad1de5ea ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:44:46 -07:00
Jiri Kosina
7e823644b6 udp: Unbreak modules that rely on external __skb_recv_udp() availability
Commit 2276f58ac5 ("udp: use a separate rx queue for packet reception")
turned static inline __skb_recv_udp() from being a trivial helper around
__skb_recv_datagram() into a UDP specific implementaion, making it
EXPORT_SYMBOL_GPL() at the same time.

There are external modules that got broken by __skb_recv_udp() not being
visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().

Rationale (one of those) why this is actually "technically correct" thing
to do: __skb_recv_udp() used to be an inline wrapper around
__skb_recv_datagram(), which itself (still, and correctly so, I believe)
is EXPORT_SYMBOL().

Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-07 20:33:10 -07:00
Eric Dumazet
64199fc0a4 ipv4: fix use-after-free in ip_cmsg_recv_dstaddr()
Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy,
do not do it.

Fixes: 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-02 22:32:05 -07:00
Eric Dumazet
2ab2ddd301 inet: make sure to grab rcu_read_lock before using ireq->ireq_opt
Timer handlers do not imply rcu_read_lock(), so my recent fix
triggered a LOCKDEP warning when SYNACK is retransmit.

Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
usages instead of guessing what is done by callers, since it is
not worth the pain.

Get rid of ireq_opt_deref() helper since it hides the logic
without real benefit, since it is now a standard rcu_dereference().

Fixes: 1ad98e9d1b ("tcp/dccp: fix lockdep issue when SYN is backlogged")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-02 15:52:12 -07:00
David S. Miller
ee0b6f4834 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2018-10-01

1) Validate address prefix lengths in the xfrm selector,
   otherwise we may hit undefined behaviour in the
   address matching functions if the prefix is too
   big for the given address family.

2) Fix skb leak on local message size errors.
   From Thadeu Lima de Souza Cascardo.

3) We currently reset the transport header back to the network
   header after a transport mode transformation is applied. This
   leads to an incorrect transport header when multiple transport
   mode transformations are applied. Reset the transport header
   only after all transformations are already applied to fix this.
   From Sowmini Varadhan.

4) We only support one offloaded xfrm, so reset crypto_done after
   the first transformation in xfrm_input(). Otherwise we may call
   the wrong input method for subsequent transformations.
   From Sowmini Varadhan.

5) Fix NULL pointer dereference when skb_dst_force clears the dst_entry.
   skb_dst_force does not really force a dst refcount anymore, it might
   clear it instead. xfrm code did not expect this, add a check to not
   dereference skb_dst() if it was cleared by skb_dst_force.

6) Validate xfrm template mode, otherwise we can get a stack-out-of-bounds
   read in xfrm_state_find. From Sean Tranchetti.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01 22:29:25 -07:00
Eric Dumazet
1ad98e9d1b tcp/dccp: fix lockdep issue when SYN is backlogged
In normal SYN processing, packets are handled without listener
lock and in RCU protected ingress path.

But syzkaller is known to be able to trick us and SYN
packets might be processed in process context, after being
queued into socket backlog.

In commit 06f877d613 ("tcp/dccp: fix other lockdep splats
accessing ireq_opt") I made a very stupid fix, that happened
to work mostly because of the regular path being RCU protected.

Really the thing protecting ireq->ireq_opt is RCU read lock,
and the pseudo request refcnt is not relevant.

This patch extends what I did in commit 449809a66c ("tcp/dccp:
block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
pair in the paths that might be taken when processing SYN from
socket backlog (thus possibly in process context)

Fixes: 06f877d613 ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01 15:42:13 -07:00
Maciej Żenczykowski
d4ce58082f net-tcp: /proc/sys/net/ipv4/tcp_probe_interval is a u32 not int
(fix documentation and sysctl access to treat it as such)

Tested:
  # zcat /proc/config.gz | egrep ^CONFIG_HZ
  CONFIG_HZ_1000=y
  CONFIG_HZ=1000
  # echo $[(1<<32)/1000 + 1] | tee /proc/sys/net/ipv4/tcp_probe_interval
  4294968
  tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument
  # echo $[(1<<32)/1000] | tee /proc/sys/net/ipv4/tcp_probe_interval
  4294967
  # echo 0 | tee /proc/sys/net/ipv4/tcp_probe_interval
  # echo -1 | tee /proc/sys/net/ipv4/tcp_probe_interval
  -1
  tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26 20:33:21 -07:00
Paolo Abeni
ccfec9e5cb ip_tunnel: be careful when accessing the inner header
Cong noted that we need the same checks introduced by commit 76c0ddd8c3
("ip6_tunnel: be careful when accessing the inner header")
even for ipv4 tunnels.

Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:27:04 -07:00
Paolo Abeni
2b5a921740 udp4: fix IP_CMSG_CHECKSUM for connected sockets
commit 2abb7cdc0d ("udp: Add support for doing checksum
unnecessary conversion") left out the early demux path for
connected sockets. As a result IP_CMSG_CHECKSUM gives wrong
values for such socket when GRO is not enabled/available.

This change addresses the issue by moving the csum conversion to a
common helper and using such helper in both the default and the
early demux rx path.

Fixes: 2abb7cdc0d ("udp: Add support for doing checksum unnecessary conversion")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:27:44 -07:00
Toke Høiland-Jørgensen
c56cae23c6 gso_segment: Reset skb->mac_len after modifying network header
When splitting a GSO segment that consists of encapsulated packets, the
skb->mac_len of the segments can end up being set wrong, causing packet
drops in particular when using act_mirred and ifb interfaces in
combination with a qdisc that splits GSO packets.

This happens because at the time skb_segment() is called, network_header
will point to the inner header, throwing off the calculation in
skb_reset_mac_len(). The network_header is subsequently adjust by the
outer IP gso_segment handlers, but they don't set the mac_len.

Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
gso_segment handlers, after they modify the network_header.

Many thanks to Eric Dumazet for his help in identifying the cause of
the bug.

Acked-by: Dave Taht <dave.taht@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:09:32 -07:00
Haishuang Yan
51dc63e391 erspan: fix error handling for erspan tunnel
When processing icmp unreachable message for erspan tunnel, tunnel id
should be erspan_net_id instead of ipgre_net_id.

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 23:50:54 -07:00
Haishuang Yan
5a64506b5c erspan: return PACKET_REJECT when the appropriate tunnel is not found
If erspan tunnel hasn't been established, we'd better send icmp port
unreachable message after receive erspan packets.

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 23:50:53 -07:00
Willem de Bruijn
0297c1c2ea tcp: rate limit synflood warnings further
Convert pr_info to net_info_ratelimited to limit the total number of
synflood warnings.

Commit 946cedccbd ("tcp: Change possible SYN flooding messages")
rate limits synflood warnings to one per listener.

Workloads that open many listener sockets can still see a high rate of
log messages. Syzkaller is one frequent example.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 23:34:20 -07:00
David S. Miller
4ecdf77091 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree:

1) Remove duplicated include at the end of UDP conntrack, from Yue Haibing.

2) Restore conntrack dependency on xt_cluster, from Martin Willi.

3) Fix splat with GSO skbs from the checksum target, from Florian Westphal.

4) Rework ct timeout support, the template strategy to attach custom timeouts
   is not correct since it will not work in conjunction with conntrack zones
   and we have a possible free after use when removing the rule due to missing
   refcounting. To fix these problems, do not use conntrack template at all
   and set custom timeout on the already valid conntrack object. This
   fix comes with a preparation patch to simplify timeout adjustment by
   initializating the first position of the timeout array for all of the
   existing trackers. Patchset from Florian Westphal.

5) Fix missing dependency on from IPv4 chain NAT type, from Florian.

6) Release chain reference counter from the flush path, from Taehee Yoo.

7) After flushing an iptables ruleset, conntrack hooks are unregistered
   and entries are left stale to be cleaned up by the timeout garbage
   collector. No TCP tracking is done on established flows by this time.
   If ruleset is reloaded, then hooks are registered again and TCP
   tracking is restored, which considers packets to be invalid. Clear
   window tracking to exercise TCP flow pickup from the middle given that
   history is lost for us. Again from Florian.

8) Fix crash from netlink interface with CONFIG_NF_CONNTRACK_TIMEOUT=y
   and CONFIG_NF_CT_NETLINK_TIMEOUT=n.

9) Broken CT target due to returning incorrect type from
   ctnl_timeout_find_get().

10) Solve conntrack clash on NF_REPEAT verdicts too, from Michal Vaner.

11) Missing conversion of hashlimit sysctl interface to new API, from
    Cong Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 21:17:30 -07:00
Taehee Yoo
5d407b071d ip: frags: fix crash in ip_do_fragment()
A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358 ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09 14:50:56 -07:00
Vincent Whitchurch
5cf4a8532c tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY
According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
flag was introduced because send(2) ignores unknown message flags and
any legacy application which was accidentally passing the equivalent of
MSG_ZEROCOPY earlier should not see any new behaviour.

Before commit f214f915e7 ("tcp: enable MSG_ZEROCOPY"), a send(2) call
which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
would succeed.  However, after that commit, it fails with -ENOBUFS.  So
it appears that the SO_ZEROCOPY flag fails to fulfill its intended
purpose.  Fix it.

Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-07 23:11:06 -07:00
Sowmini Varadhan
bfc0698beb xfrm: reset transport header back to network header after all input transforms ahave been applied
A policy may have been set up with multiple transforms (e.g., ESP
and ipcomp). In this situation, the ingress IPsec processing
iterates in xfrm_input() and applies each transform in turn,
processing the nexthdr to find any additional xfrm that may apply.

This patch resets the transport header back to network header
only after the last transformation so that subsequent xfrms
can find the correct transport header.

Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Suggested-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-09-04 10:26:30 +02:00
Hangbin Liu
ff06525fcb igmp: fix incorrect unsolicit report count after link down and up
After link down and up, i.e. when call ip_mc_up(), we doesn't init
im->unsolicit_count. So after igmp_timer_expire(), we will not start
timer again and only send one unsolicit report at last.

Fix it by initializing im->unsolicit_count in igmp_group_added(), so
we can respect igmp robustness value.

Fixes: 24803f38a5 ("igmp: do not remove igmp souce list info when set link down")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 13:39:37 -07:00
Hangbin Liu
4fb7253e4f igmp: fix incorrect unsolicit report count when join group
We should not start timer if im->unsolicit_count equal to 0 after decrease.
Or we will send one more unsolicit report message. i.e. 3 instead of 2 by
default.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 13:39:37 -07:00
Florian Westphal
63cc357f7b tcp: do not restart timewait timer on rst reception
RFC 1337 says:
 ''Ignore RST segments in TIME-WAIT state.
   If the 2 minute MSL is enforced, this fix avoids all three hazards.''

So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
expire rather than removing it instantly when a reset is received.

However, Linux will also re-start the TIME-WAIT timer.

This causes connect to fail when tying to re-use ports or very long
delays (until syn retry interval exceeds MSL).

packetdrill test case:
// Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
`sysctl net.ipv4.tcp_rfc1337=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 29200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive first segment
0.310 < P. 1:1001(1000) ack 1 win 46

// Send one ACK
0.310 > . 1:1(0) ack 1001

// read 1000 byte
0.310 read(4, ..., 1000) = 1000

// Application writes 100 bytes
0.350 write(4, ..., 100) = 100
0.350 > P. 1:101(100) ack 1001

// ACK
0.500 < . 1001:1001(0) ack 101 win 257

// close the connection
0.600 close(4) = 0
0.600 > F. 101:101(0) ack 1001 win 244

// Our side is in FIN_WAIT_1 & waits for ack to fin
0.7 < . 1001:1001(0) ack 102 win 244

// Our side is in FIN_WAIT_2 with no outstanding data.
0.8 < F. 1001:1001(0) ack 102 win 244
0.8 > . 102:102(0) ack 1002 win 244

// Our side is now in TIME_WAIT state, send ack for fin.
0.9 < F. 1002:1002(0) ack 102 win 244
0.9 > . 102:102(0) ack 1002 win 244

// Peer reopens with in-window SYN:
1.000 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// Therefore, reply with ACK.
1.000 > . 102:102(0) ack 1002 win 244

// Peer sends RST for this ACK.  Normally this RST results
// in tw socket removal, but rfc1337=1 setting prevents this.
1.100 < R 1002:1002(0) win 244

// second syn. Due to rfc1337=1 expect another pure ACK.
31.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
31.0 > . 102:102(0) ack 1002 win 244

// .. and another RST from peer.
31.1 < R 1002:1002(0) win 244
31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

// third syn after one minute.  Time-Wait socket should have expired by now.
63.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// so we expect a syn-ack & 3whs to proceed from here on.
63.0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>

Without this patch, 'ss' shows restarts of tw timer and last packet is
thus just another pure ack, more than one minute later.

This restores the original code from commit 283fd6cf0be690a83
("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

For some reason the else branch was removed/lost in 1f28b683339f7
("Merge in TCP/UDP optimizations and [..]") and timer restart became
unconditional.

Reported-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:10:35 -07:00
Florian Westphal
e075841220 netfilter: kconfig: nat related expression depend on nftables core
NF_TABLES_IPV4 is now boolean so it is possible to set

NF_TABLES=m
NF_TABLES_IPV4=y
NFT_CHAIN_NAT_IPV4=y

which causes:
nft_chain_nat_ipv4.c:(.text+0x6d): undefined reference to `nft_do_chain'

Wrap NFT_CHAIN_NAT_IPV4 and related nat expressions with NF_TABLES to
restore the dependency.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 02c7b25e5f ("netfilter: nf_tables: build-in filter chain type")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-31 19:07:10 +02:00
Xin Long
84581bdae9 erspan: set erspan_ver to 1 by default when adding an erspan dev
After erspan_ver is introudced, if erspan_ver is not set in iproute, its
value will be left 0 by default. Since Commit 02f99df187 ("erspan: fix
invalid erspan version."), it has broken the traffic due to the version
check in erspan_xmit if users are not aware of 'erspan_ver' param, like
using an old version of iproute.

To fix this compatibility problem, it sets erspan_ver to 1 by default
when adding an erspan dev in erspan_setup. Note that we can't do it in
ipgre_netlink_parms, as this function is also used by ipgre_changelink.

Fixes: 02f99df187 ("erspan: fix invalid erspan version.")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:13:17 -07:00
Kevin Yang
8e995bf14f tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0
This commit fixes a corner case where TCP BBR would enter PROBE_RTT
mode but not reduce its cwnd. If a TCP receiver ACKed less than one
full segment, the number of delivered/acked packets was 0, so that
bbr_set_cwnd() would short-circuit and exit early, without cutting
cwnd to the value we want for PROBE_RTT.

The fix is to instead make sure that even when 0 full packets are
ACKed, we do apply all the appropriate caps, including the cap that
applies in PROBE_RTT mode.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:45:32 -07:00
Kevin Yang
5490b32dce tcp_bbr: in restart from idle, see if we should exit PROBE_RTT
This patch fix the case where BBR does not exit PROBE_RTT mode when
it restarts from idle. When BBR restarts from idle and if BBR is in
PROBE_RTT mode, BBR should check if it's time to exit PROBE_RTT. If
yes, then BBR should exit PROBE_RTT mode and restore the cwnd to its
full value.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:45:32 -07:00
Kevin Yang
fb99886224 tcp_bbr: add bbr_check_probe_rtt_done() helper
This patch add a helper function bbr_check_probe_rtt_done() to
  1. check the condition to see if bbr should exit probe_rtt mode;
  2. process the logic of exiting probe_rtt mode.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:45:32 -07:00
Eric Dumazet
431280eebe ipv4: tcp: send zero IPID for RST and ACK sent in SYN-RECV and TIME-WAIT state
tcp uses per-cpu (and per namespace) sockets (net->ipv4.tcp_sk) internally
to send some control packets.

1) RST packets, through tcp_v4_send_reset()
2) ACK packets in SYN-RECV and TIME-WAIT state, through tcp_v4_send_ack()

These packets assert IP_DF, and also use the hashed IP ident generator
to provide an IPv4 ID number.

Geoff Alexander reported this could be used to build off-path attacks.

These packets should not be fragmented, since their size is smaller than
IPV4_MIN_MTU. Only some tunneled paths could eventually have to fragment,
regardless of inner IPID.

We really can use zero IPID, to address the flaw, and as a bonus,
avoid a couple of atomic operations in ip_idents_reserve()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Geoff Alexander <alexandg@cs.unm.edu>
Tested-by: Geoff Alexander <alexandg@cs.unm.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:42:58 -07:00
Haishuang Yan
cd1aa9c2c6 ip_vti: fix a null pointer deferrence when create vti fallback tunnel
After set fb_tunnels_only_for_init_net to 1, the itn->fb_tunnel_dev will
be NULL and will cause following crash:

[ 2742.849298] BUG: unable to handle kernel NULL pointer dereference at 0000000000000941
[ 2742.851380] PGD 800000042c21a067 P4D 800000042c21a067 PUD 42aaed067 PMD 0
[ 2742.852818] Oops: 0002 [#1] SMP PTI
[ 2742.853570] CPU: 7 PID: 2484 Comm: unshare Kdump: loaded Not tainted 4.18.0-rc8+ #2
[ 2742.855163] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014
[ 2742.856970] RIP: 0010:vti_init_net+0x3a/0x50 [ip_vti]
[ 2742.858034] Code: 90 83 c0 48 c7 c2 20 a1 83 c0 48 89 fb e8 6e 3b f6 ff 85 c0 75 22 8b 0d f4 19 00 00 48 8b 93 00 14 00 00 48 8b 14 ca 48 8b 12 <c6> 82 41 09 00 00 04 c6 82 38 09 00 00 45 5b c3 66 0f 1f 44 00 00
[ 2742.861940] RSP: 0018:ffff9be28207fde0 EFLAGS: 00010246
[ 2742.863044] RAX: 0000000000000000 RBX: ffff8a71ebed4980 RCX: 0000000000000013
[ 2742.864540] RDX: 0000000000000000 RSI: 0000000000000013 RDI: ffff8a71ebed4980
[ 2742.866020] RBP: ffff8a71ea717000 R08: ffffffffc083903c R09: ffff8a71ea717000
[ 2742.867505] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a71ebed4980
[ 2742.868987] R13: 0000000000000013 R14: ffff8a71ea5b49c0 R15: 0000000000000000
[ 2742.870473] FS:  00007f02266c9740(0000) GS:ffff8a71ffdc0000(0000) knlGS:0000000000000000
[ 2742.872143] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2742.873340] CR2: 0000000000000941 CR3: 000000042bc20006 CR4: 00000000001606e0
[ 2742.874821] Call Trace:
[ 2742.875358]  ops_init+0x38/0xf0
[ 2742.876078]  setup_net+0xd9/0x1f0
[ 2742.876789]  copy_net_ns+0xb7/0x130
[ 2742.877538]  create_new_namespaces+0x11a/0x1d0
[ 2742.878525]  unshare_nsproxy_namespaces+0x55/0xa0
[ 2742.879526]  ksys_unshare+0x1a7/0x330
[ 2742.880313]  __x64_sys_unshare+0xe/0x20
[ 2742.881131]  do_syscall_64+0x5b/0x180
[ 2742.881933]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reproduce:
echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
modprobe ip_vti
unshare -n

Fixes: 79134e6ce2 ("net: do not create fallback tunnels for non-default namespaces")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19 11:26:39 -07:00
Daniel Borkmann
90545cdc3f tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
I found that in BPF sockmap programs once we either delete a socket
from the map or we updated a map slot and the old socket was purged
from the map that these socket can never get reattached into a map
even though their related psock has been dropped entirely at that
point.

Reason is that tcp_cleanup_ulp() leaves the old icsk->icsk_ulp_ops
intact, so that on the next tcp_set_ulp_id() the kernel returns an
-EEXIST thinking there is still some active ULP attached.

BPF sockmap is the only one that has this issue as the other user,
kTLS, only calls tcp_cleanup_ulp() from tcp_v4_destroy_sock() whereas
sockmap semantics allow dropping the socket from the map with all
related psock state being cleaned up.

Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16 14:58:08 -07:00
Daniel Borkmann
037b0b86ec tcp, ulp: add alias for all ulp modules
Lets not turn the TCP ULP lookup into an arbitrary module loader as
we only intend to load ULP modules through this mechanism, not other
unrelated kernel modules:

  [root@bar]# cat foo.c
  #include <sys/types.h>
  #include <sys/socket.h>
  #include <linux/tcp.h>
  #include <linux/in.h>

  int main(void)
  {
      int sock = socket(PF_INET, SOCK_STREAM, 0);
      setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
      return 0;
  }

  [root@bar]# gcc foo.c -O2 -Wall
  [root@bar]# lsmod | grep sctp
  [root@bar]# ./a.out
  [root@bar]# lsmod | grep sctp
  sctp                 1077248  4
  libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
  [root@bar]#

Fix it by adding module alias to TCP ULP modules, so probing module
via request_module() will be limited to tcp-ulp-[name]. The existing
modules like kTLS will load fine given tcp-ulp-tls alias, but others
will fail to load:

  [root@bar]# lsmod | grep sctp
  [root@bar]# ./a.out
  [root@bar]# lsmod | grep sctp
  [root@bar]#

Sockmap is not affected from this since it's either built-in or not.

Fixes: 734942cc4e ("tcp: ULP infrastructure")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16 14:58:07 -07:00
David S. Miller
c1617fb4c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-08-13

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add driver XDP support for veth. This can be used in conjunction with
   redirect of another XDP program e.g. sitting on NIC so the xdp_frame
   can be forwarded to the peer veth directly without modification,
   from Toshiaki.

2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
   in order to provide more control and visibility on where a SO_REUSEPORT
   sk should be located, and the latter enables to directly select a sk
   from the bpf map. This also enables map-in-map for application migration
   use cases, from Martin.

3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
   of cgroup v2 that is the ancestor of the cgroup associated with the
   skb at the ancestor_level, from Andrey.

4) Implement BPF fs map pretty-print support based on BTF data for regular
   hash table and LRU map, from Yonghong.

5) Decouple the ability to attach BTF for a map from the key and value
   pretty-printer in BPF fs, and enable further support of BTF for maps for
   percpu and LPM trie, from Daniel.

6) Implement a better BPF sample of using XDP's CPU redirect feature for
   load balancing SKB processing to remote CPU. The sample implements the
   same XDP load balancing as Suricata does which is symmetric hash based
   on IP and L4 protocol, from Jesper.

7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
   critical path as it is ensured that the allocator is present, from Björn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 10:07:23 -07:00
Peter Oskolkov
a4fd284a1f ip: process in-order fragments efficiently
This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 17:54:18 -07:00
Peter Oskolkov
353c9cb360 ip: add helpers to process in-order fragments faster.
This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 17:54:18 -07:00
Yuchung Cheng
fd2123a3d7 tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag
Previously commit 9aee400061 ("tcp: ack immediately when a cwr
packet arrives") calls tcp_enter_quickack_mode to force sending
two immediate ACKs upon receiving a packet w/ CWR flag. The side
effect is it'll also reset the delayed ACK timer and interactive
session tracking. This patch removes that side effect by using the
new ACK_NOW flag to force an immmediate ACK.

Packetdrill to demonstrate:

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < [ect0] . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 < [ect0] . 1:1001(1000) ack 1 win 257
   +0 > [ect01] . 1:1(0) ack 1001

   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 1:2(1) ack 1001

   +0 < [ect0] . 1001:2001(1000) ack 2 win 257
   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 2:3(1) ack 2001

   +0 < [ect0] . 2001:3001(1000) ack 3 win 257
   +0 < [ect0] . 3001:4001(1000) ack 3 win 257
   // Ack delayed ...

   +.01 < [ce] P. 4001:4501(500) ack 3 win 257
   +0 > [ect01] . 3:3(0) ack 4001
   +0 > [ect01] E. 3:3(0) ack 4501

+.001 read(4, ..., 4500) = 4500
   +0 write(4, ..., 1) = 1
   +0 > [ect01] PE. 3:4(1) ack 4501 win 100

 +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
   // No delayed ACK on CWR flag
   +0 > [ect01] . 4:4(0) ack 5501

 +.31 < [ect0] . 5501:6501(1000) ack 4 win 257
   +0 > [ect01] . 4:4(0) ack 6501

Fixes: 9aee400061 ("tcp: ack immediately when a cwr packet arrives")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 11:31:35 -07:00
Yuchung Cheng
15bdd5686c tcp: always ACK immediately on hole repairs
RFC 5681 sec 4.2:
  To provide feedback to senders recovering from losses, the receiver
  SHOULD send an immediate ACK when it receives a data segment that
  fills in all or part of a gap in the sequence space.

When a gap is partially filled, __tcp_ack_snd_check already checks
the out-of-order queue and correctly send an immediate ACK. However
when a gap is fully filled, the previous implementation only resets
pingpong mode which does not guarantee an immediate ACK because the
quick ACK counter may be zero. This patch addresses this issue by
marking the one-time immediate ACK flag instead.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 11:31:35 -07:00