Commit graph

566 commits

Author SHA1 Message Date
Minchan Kim
c27ea19ea4 ANDROID: GKI: attribute page lock and waitqueue functions as sched
trace_sched_blocked_trace in CFS is really useful for debugging via
trace because it tell where the process was stuck on callstack.

For example,
           <...>-6143  ( 6136) [005] d..2    50.278987: sched_blocked_reason: pid=6136 iowait=0 caller=SyS_mprotect+0x88/0x208
           <...>-6136  ( 6136) [005] d..2    50.278990: sched_blocked_reason: pid=6142 iowait=0 caller=do_page_fault+0x1f4/0x3b0
           <...>-6142  ( 6136) [006] d..2    50.278996: sched_blocked_reason: pid=6144 iowait=0 caller=SyS_prctl+0x52c/0xb58
           <...>-6144  ( 6136) [006] d..2    50.279007: sched_blocked_reason: pid=6136 iowait=0 caller=vm_mmap_pgoff+0x74/0x104

However, sometime it gives pointless information like this.
    RenderThread-2322  ( 1805) [006] d.s3    50.319046: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220
     logd.writer-594   (  587) [002] d.s3    50.334011: sched_blocked_reason: pid=6126 iowait=1 caller=wait_on_page_bit+0x194/0x208
  kworker/u16:13-333   (  333) [007] d.s4    50.343161: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220

Such wait_on_page_bit, __lock_page_killable are pointless because it doesn't
carry on higher information to identify the callstack.

The reason is page_lock and waitqueue are special synchronization method unlike
other normal locks(mutex, spinlock).
Let's mark them as "__sched" so get_wchan which used in trace_sched_blocked_trace
could detect it and skip them. It will produce more meaningful callstack
function like this.

           <...>-2867  ( 1068) [002] d.h4   124.209701: sched_blocked_reason: pid=329 iowait=0 caller=worker_thread+0x378/0x470
           <...>-2867  ( 1068) [002] d.s3   124.209763: sched_blocked_reason: pid=8454 iowait=1 caller=__filemap_fdatawait_range+0xa0/0x104
           <...>-2867  ( 1068) [002] d.s4   124.209803: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
 ScreenDecoratio-2364  ( 1867) [002] d.s3   124.209973: sched_blocked_reason: pid=8454 iowait=1 caller=f2fs_wait_on_page_writeback+0x84/0xcc
 ScreenDecoratio-2364  ( 1867) [002] d.s4   124.209986: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
           <...>-329   (  329) [000] d..3   124.210435: sched_blocked_reason: pid=538 iowait=0 caller=worker_thread+0x378/0x470
  kworker/u16:13-538   (  538) [007] d..3   124.210450: sched_blocked_reason: pid=6 iowait=0 caller=worker_thread+0x378/0x470

Bug: 144961676
Bug: 144713689
Change-Id: I30397400c5d056946bdfbc86c9ef5f4d7e6c98fe
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
Bug: 152417756
(cherry picked from commit 8a780c0eb6800cecbfce21362c2d2a3bcab14e1c)
Signed-off-by: Saravana Kannan <saravanak@google.com>
2020-04-17 22:51:44 -07:00
Vinayak Menon
34e3bd1772 ANDROID: GKI: mm: vmstat: add pageoutclean
vmstat events currently count pgpgout, but that includes
only the writebacks, and not the reclaim of clean
pages. Add an event to count clean page evictions. This is
helpful to evaluate page thrashing cases.

Bug: 151963988
Test: build
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
(cherry picked from commit f15713432c)
[surenb: cherry-picked to resolve ABI diffs caused by vm_event_item enum]
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I113cfce495b5b70f0cef88c8425c7252e01291fe
2020-03-24 13:25:06 -07:00
Jaegeuk Kim
435c9a613f Merge remote-tracking branch 'aosp/upstream-f2fs-stable-linux-4.19.y/v5.5-rc1' into android-4.19
* aosp/upstream-f2fs-stable-linux-4.19.y:
  f2fs: stop GC when the victim becomes fully valid
  f2fs: expose main_blkaddr in sysfs
  f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
  f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
  f2fs: show f2fs instance in printk_ratelimited
  f2fs: fix potential overflow
  f2fs: fix to update dir's i_pino during cross_rename
  f2fs: support aligned pinned file
  f2fs: avoid kernel panic on corruption test
  f2fs: fix wrong description in document
  f2fs: cache global IPU bio
  f2fs: fix to avoid memory leakage in f2fs_listxattr
  f2fs: check total_segments from devices in raw_super
  f2fs: update multi-dev metadata in resize_fs
  f2fs: mark recovery flag correctly in read_raw_super_block()
  f2fs: fix to update time in lazytime mode
  vfs: don't allow writes to swap files
  mm: set S_SWAPFILE on blockdev swap devices

Bug: 146023540
Change-Id: Ia24ce5f48f245dd7ba4fd94aa00a7d84615a8b22
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
2019-12-18 10:26:14 -08:00
Darrick J. Wong
8cfd90e159 vfs: don't allow writes to swap files
Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-12-02 15:03:46 -08:00
Greg Kroah-Hartman
b777b6f211 This is the 4.19.84 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl3K+EEACgkQONu9yGCS
 aT5ReA/+IrdgTosfTtLizJhq+rR7UClBC4D7rOEBv6BVpKMNRdldcaVsFy6/MO6I
 eq3keoicBpGxY8j6mp559GT7ACvXC3L6SALlwvfmGX6BHGVmwTE1Q6UmYbMfCoVY
 kkZ6YUGc9T20hfvi1aG/bXerv90phIj3mSOttpLQsiNmlqkP4o2ayA9c/ohhINZ9
 N3tEqh81vuZ9DQKdR6hY3iiqq8j3x5if5PppfdtQJMq8U7EUePHeiOrl6+j/PWXV
 UHF+BGXVJBXF7Qn3NGzLAZsI5HI+PT9ec9R5/6kFwbKnD88CR80AJpwzmPXmLjTu
 584oSQHzSQzSl5+r53SrK3EgOUdsduicGHJfO9RU1ttMAkkz+t+7LSSMDMX0qMFX
 OGJdaiQc5arKNEcgckpnNDScRAlxK7IwUWfgcyFsjgMhLKxEfAuz8+JGJEp3QF6T
 bNXupekzCnhE70j1/1MObLrTshW+BwIfYgSWwjH01Fq5h6/7IcuGWtfpuanhFj0k
 sa4J02kdCvUPNAAlXAt07E/QS/4pLEV/Lh4K86MGR2ltYQjPEyYoY9LwIJAFEyQE
 /DYUTS9GvlOitpbdUT48v/msZpOHWvhza8He/ZChN17RqFrIY9ykkPkA22moCqWB
 w7d4HTP3dkYlt00e2DAclbCMsek9rNn+oTnY1jxGKa3pXxZAKyo=
 =T5EF
 -----END PGP SIGNATURE-----

Merge 4.19.84 into android-4.19

Changes in 4.19.84
	bonding: fix state transition issue in link monitoring
	CDC-NCM: handle incomplete transfer of MTU
	ipv4: Fix table id reference in fib_sync_down_addr
	net: ethernet: octeon_mgmt: Account for second possible VLAN header
	net: fix data-race in neigh_event_send()
	net: qualcomm: rmnet: Fix potential UAF when unregistering
	net: usb: qmi_wwan: add support for DW5821e with eSIM support
	NFC: fdp: fix incorrect free object
	nfc: netlink: fix double device reference drop
	NFC: st21nfca: fix double free
	qede: fix NULL pointer deref in __qede_remove()
	net: mscc: ocelot: don't handle netdev events for other netdevs
	net: mscc: ocelot: fix NULL pointer on LAG slave removal
	ipv6: fixes rt6_probe() and fib6_nh->last_probe init
	net: hns: Fix the stray netpoll locks causing deadlock in NAPI path
	ALSA: timer: Fix incorrectly assigned timer instance
	ALSA: bebob: fix to detect configured source of sampling clock for Focusrite Saffire Pro i/o series
	ALSA: hda/ca0132 - Fix possible workqueue stall
	mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
	mm, meminit: recalculate pcpu batch and high limits after init completes
	mm: thp: handle page cache THP correctly in PageTransCompoundMap
	mm, vmstat: hide /proc/pagetypeinfo from normal users
	dump_stack: avoid the livelock of the dump_lock
	tools: gpio: Use !building_out_of_srctree to determine srctree
	perf tools: Fix time sorting
	drm/radeon: fix si_enable_smc_cac() failed issue
	HID: wacom: generic: Treat serial number and related fields as unsigned
	soundwire: depend on ACPI
	soundwire: bus: set initial value to port_status
	arm64: Do not mask out PTE_RDONLY in pte_same()
	ceph: fix use-after-free in __ceph_remove_cap()
	ceph: add missing check in d_revalidate snapdir handling
	iio: adc: stm32-adc: fix stopping dma
	iio: imu: adis16480: make sure provided frequency is positive
	iio: srf04: fix wrong limitation in distance measuring
	ARM: sunxi: Fix CPU powerdown on A83T
	netfilter: nf_tables: Align nft_expr private data to 64-bit
	netfilter: ipset: Fix an error code in ip_set_sockfn_get()
	intel_th: pci: Add Comet Lake PCH support
	intel_th: pci: Add Jasper Lake PCH support
	x86/apic/32: Avoid bogus LDR warnings
	SMB3: Fix persistent handles reconnect
	can: usb_8dev: fix use-after-free on disconnect
	can: flexcan: disable completely the ECC mechanism
	can: c_can: c_can_poll(): only read status register after status IRQ
	can: peak_usb: fix a potential out-of-sync while decoding packets
	can: rx-offload: can_rx_offload_queue_sorted(): fix error handling, avoid skb mem leak
	can: gs_usb: gs_can_open(): prevent memory leak
	can: dev: add missing of_node_put() after calling of_get_child_by_name()
	can: mcba_usb: fix use-after-free on disconnect
	can: peak_usb: fix slab info leak
	configfs: stash the data we need into configfs_buffer at open time
	configfs_register_group() shouldn't be (and isn't) called in rmdirable parts
	configfs: new object reprsenting tree fragments
	configfs: provide exclusion between IO and removals
	configfs: fix a deadlock in configfs_symlink()
	ALSA: usb-audio: More validations of descriptor units
	ALSA: usb-audio: Simplify parse_audio_unit()
	ALSA: usb-audio: Unify the release of usb_mixer_elem_info objects
	ALSA: usb-audio: Remove superfluous bLength checks
	ALSA: usb-audio: Clean up check_input_term()
	ALSA: usb-audio: Fix possible NULL dereference at create_yamaha_midi_quirk()
	ALSA: usb-audio: remove some dead code
	ALSA: usb-audio: Fix copy&paste error in the validator
	sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
	sched/fair: Fix -Wunused-but-set-variable warnings
	usbip: Fix vhci_urb_enqueue() URB null transfer buffer error path
	usbip: Implement SG support to vhci-hcd and stub driver
	PCI: tegra: Enable Relaxed Ordering only for Tegra20 & Tegra30
	HID: google: add magnemite/masterball USB ids
	dmaengine: xilinx_dma: Fix control reg update in vdma_channel_set_config
	dmaengine: sprd: Fix the possible memory leak issue
	HID: intel-ish-hid: fix wrong error handling in ishtp_cl_alloc_tx_ring()
	RDMA/mlx5: Clear old rate limit when closing QP
	iw_cxgb4: fix ECN check on the passive accept
	RDMA/qedr: Fix reported firmware version
	net/mlx5e: TX, Fix consumer index of error cqe dump
	net/mlx5: prevent memory leak in mlx5_fpga_conn_create_cq
	scsi: qla2xxx: fixup incorrect usage of host_byte
	RDMA/uverbs: Prevent potential underflow
	net: openvswitch: free vport unless register_netdevice() succeeds
	scsi: lpfc: Honor module parameter lpfc_use_adisc
	scsi: qla2xxx: Initialized mailbox to prevent driver load failure
	netfilter: nf_flow_table: set timeout before insertion into hashes
	ipvs: don't ignore errors in case refcounting ip_vs module fails
	ipvs: move old_secure_tcp into struct netns_ipvs
	bonding: fix unexpected IFF_BONDING bit unset
	macsec: fix refcnt leak in module exit routine
	usb: fsl: Check memory resource before releasing it
	usb: gadget: udc: atmel: Fix interrupt storm in FIFO mode.
	usb: gadget: composite: Fix possible double free memory bug
	usb: dwc3: pci: prevent memory leak in dwc3_pci_probe
	usb: gadget: configfs: fix concurrent issue between composite APIs
	usb: dwc3: remove the call trace of USBx_GFLADJ
	perf/x86/amd/ibs: Fix reading of the IBS OpData register and thus precise RIP validity
	perf/x86/amd/ibs: Handle erratum #420 only on the affected CPU family (10h)
	perf/x86/uncore: Fix event group support
	USB: Skip endpoints with 0 maxpacket length
	USB: ldusb: use unsigned size format specifiers
	usbip: tools: Fix read_usb_vudc_device() error path handling
	RDMA/iw_cxgb4: Avoid freeing skb twice in arp failure case
	RDMA/hns: Prevent memory leaks of eq->buf_list
	scsi: qla2xxx: stop timer in shutdown path
	nvme-multipath: fix possible io hang after ctrl reconnect
	fjes: Handle workqueue allocation failure
	net: hisilicon: Fix "Trying to free already-free IRQ"
	net: mscc: ocelot: fix vlan_filtering when enslaving to bridge before link is up
	net: mscc: ocelot: refuse to overwrite the port's native vlan
	iommu/amd: Apply the same IVRS IOAPIC workaround to Acer Aspire A315-41
	drm/amdgpu: If amdgpu_ib_schedule fails return back the error.
	drm/amd/display: Passive DP->HDMI dongle detection fix
	hv_netvsc: Fix error handling in netvsc_attach()
	usb: dwc3: gadget: fix race when disabling ep with cancelled xfers
	NFSv4: Don't allow a cached open with a revoked delegation
	net: ethernet: arc: add the missed clk_disable_unprepare
	igb: Fix constant media auto sense switching when no cable is connected
	e1000: fix memory leaks
	pinctrl: intel: Avoid potential glitches if pin is in GPIO mode
	ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
	pinctrl: cherryview: Fix irq_valid_mask calculation
	blkcg: make blkcg_print_stat() print stats only for online blkgs
	iio: imu: mpu6050: Add support for the ICM 20602 IMU
	iio: imu: inv_mpu6050: fix no data on MPU6050
	mm/filemap.c: don't initiate writeback if mapping has no dirty pages
	cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
	usbip: Fix free of unallocated memory in vhci tx
	netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 sets
	net: prevent load/store tearing on sk->sk_stamp
	iio: imu: mpu6050: Fix FIFO layout for ICM20602
	vsock/virtio: fix sock refcnt holding during the shutdown
	drm/i915: Rename gen7 cmdparser tables
	drm/i915: Disable Secure Batches for gen6+
	drm/i915: Remove Master tables from cmdparser
	drm/i915: Add support for mandatory cmdparsing
	drm/i915: Support ro ppgtt mapped cmdparser shadow buffers
	drm/i915: Allow parsing of unsized batches
	drm/i915: Add gen9 BCS cmdparsing
	drm/i915/cmdparser: Use explicit goto for error paths
	drm/i915/cmdparser: Add support for backward jumps
	drm/i915/cmdparser: Ignore Length operands during command matching
	drm/i915: Lower RM timeout to avoid DSI hard hangs
	drm/i915/gen8+: Add RC6 CTX corruption WA
	drm/i915/cmdparser: Fix jump whitelist clearing
	KVM: x86: use Intel speculation bugs and features as derived in generic x86 code
	x86/msr: Add the IA32_TSX_CTRL MSR
	x86/cpu: Add a helper function x86_read_arch_cap_msr()
	x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
	x86/speculation/taa: Add mitigation for TSX Async Abort
	x86/speculation/taa: Add sysfs reporting for TSX Async Abort
	kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
	x86/tsx: Add "auto" option to the tsx= cmdline parameter
	x86/speculation/taa: Add documentation for TSX Async Abort
	x86/tsx: Add config options to set tsx=on|off|auto
	x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
	x86/bugs: Add ITLB_MULTIHIT bug infrastructure
	x86/cpu: Add Tremont to the cpu vulnerability whitelist
	cpu/speculation: Uninline and export CPU mitigations helpers
	Documentation: Add ITLB_MULTIHIT documentation
	kvm: x86, powerpc: do not allow clearing largepages debugfs entry
	kvm: Convert kvm_lock to a mutex
	kvm: mmu: Do not release the page inside mmu_set_spte()
	KVM: x86: make FNAME(fetch) and __direct_map more similar
	KVM: x86: remove now unneeded hugepage gfn adjustment
	KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ON
	KVM: x86: add tracepoints around __direct_map and FNAME(fetch)
	KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active
	kvm: mmu: ITLB_MULTIHIT mitigation
	kvm: Add helper function for creating VM worker threads
	kvm: x86: mmu: Recovery of shattered NX large pages
	Linux 4.19.84

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7a820f00c4b868ed677bb49613f835b7e67a3a06
2019-11-14 14:15:21 +08:00
Konstantin Khlebnikov
d3b3c0a146 mm/filemap.c: don't initiate writeback if mapping has no dirty pages
commit c3aab9a0bd91b696a852169479b7db1ece6cbf8c upstream.

Functions like filemap_write_and_wait_range() should do nothing if inode
has no dirty pages or pages currently under writeback.  But they anyway
construct struct writeback_control and this does some atomic operations if
CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
updates state of writeback ownership, on slow path might be more work.
Current this path is safely avoided only when inode mapping has no pages.

For example generic_file_read_iter() calls filemap_write_and_wait_range()
at each O_DIRECT read - pretty hot path.

This patch skips starting new writeback if mapping has no dirty tags set.
If writeback is already in progress filemap_write_and_wait_range() will
wait for it.

Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12 19:21:20 +01:00
Greg Kroah-Hartman
f232ce65ba This is the 4.19.62 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl09QMoACgkQONu9yGCS
 aT7bBQ//dApu9DLQ3tyfc+tucIc3ViOcif3zlfLWRqClcLXw1HF63/Q/u40kYuf1
 TOnMkVXFWmpeGPgz2/tWw9TN+ibHaPBfyMoPGw/Ehs5dB8qn/F4AnLqkePFimIYQ
 +YOZTbvPqW/xSh8I79i6y2NxrZLGvO9FDY9YUblBKz4lKWc+HULXn9RmjtxlBOLE
 s6epoM0tWKPPeM5MyJIwT+9MZ0o3Cj89YNv+PqYLQSzcvbUbHDUiMt+DO4rqSyyf
 6fQe4mSX4tk+tbviyEruvKH3kJg0rWS1Maw1/h+AOJ8Gmr2WZ+vYBorPEdwdYu5K
 eootc3Jkj4wtcPsvKtNzWVPqJxdd5j651vS8/bxA0DmOVQM7A8dUBWKtMJGGG7nM
 nIRvxoMo+kv9QE/DJpeQ/apE9WnBflSqw6/DYdqs11gTX4E+beR75aRoVD1ue0lS
 if5CfnM0BTiOO1rMdMo42rzcBfh2DLt5a18WXsMmYOaSMGEJuF0KjgaGc5E4N00A
 QlqIJ7PKk+Kb53jz5oLV2vB/SXNvAQRLAvMTMH2Mst4veDonYyiVTfp6C+r8DJYc
 hvyaF1FoCMTWV5XsINmM2mlJ2+/G0nV5kLDmVPUHiFxE0JXnPQ8iCx9a96Ub+Zim
 F//muwohNIUa6EEN6AwrEUsoyVZ7cP/aR5wHQ9PAY3RV5nis474=
 =zWcX
 -----END PGP SIGNATURE-----

Merge 4.19.62 into android-4.19

Changes in 4.19.62
	bnx2x: Prevent load reordering in tx completion processing
	caif-hsi: fix possible deadlock in cfhsi_exit_module()
	hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
	igmp: fix memory leak in igmpv3_del_delrec()
	ipv4: don't set IPv6 only flags to IPv4 addresses
	ipv6: rt6_check should return NULL if 'from' is NULL
	ipv6: Unlink sibling route in case of failure
	net: bcmgenet: use promisc for unsupported filters
	net: dsa: mv88e6xxx: wait after reset deactivation
	net: make skb_dst_force return true when dst is refcounted
	net: neigh: fix multiple neigh timer scheduling
	net: openvswitch: fix csum updates for MPLS actions
	net: phy: sfp: hwmon: Fix scaling of RX power
	net: stmmac: Re-work the queue selection for TSO packets
	nfc: fix potential illegal memory access
	r8169: fix issue with confused RX unit after PHY power-down on RTL8411b
	rxrpc: Fix send on a connected, but unbound socket
	sctp: fix error handling on stream scheduler initialization
	sky2: Disable MSI on ASUS P6T
	tcp: be more careful in tcp_fragment()
	tcp: fix tcp_set_congestion_control() use from bpf hook
	tcp: Reset bytes_acked and bytes_received when disconnecting
	vrf: make sure skb->data contains ip header to make routing
	net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn
	macsec: fix use-after-free of skb during RX
	macsec: fix checksumming after decryption
	netrom: fix a memory leak in nr_rx_frame()
	netrom: hold sock when setting skb->destructor
	net_sched: unset TCQ_F_CAN_BYPASS when adding filters
	net/tls: make sure offload also gets the keys wiped
	sctp: not bind the socket in sctp_connect
	net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling
	net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query
	net: bridge: don't cache ether dest pointer on input
	net: bridge: stp: don't cache eth dest pointer before skb pull
	dma-buf: balance refcount inbalance
	dma-buf: Discard old fence_excl on retrying get_fences_rcu for realloc
	gpio: davinci: silence error prints in case of EPROBE_DEFER
	MIPS: lb60: Fix pin mappings
	perf/core: Fix exclusive events' grouping
	perf/core: Fix race between close() and fork()
	ext4: don't allow any modifications to an immutable file
	ext4: enforce the immutable flag on open files
	mm: add filemap_fdatawait_range_keep_errors()
	jbd2: introduce jbd2_inode dirty range scoping
	ext4: use jbd2_inode dirty range scoping
	ext4: allow directory holes
	KVM: nVMX: do not use dangling shadow VMCS after guest reset
	KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested
	mm: vmscan: scan anonymous pages on file refaults
	net: sched: verify that q!=NULL before setting q->flags
	Linux 4.19.62

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2eb23bda9d5294a5c874fe4f403934fd99e84661
2019-07-28 08:43:04 +02:00
Ross Zwisler
4becd6c11e mm: add filemap_fdatawait_range_keep_errors()
commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.

In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.

Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:29 +02:00
Christoph Hellwig
2bf50b5a8e FROMLIST: mm: don't cast ->readpage to filler_t for do_read_cache_page
We can just pass a NULL filler and do the right thing inside of
do_read_cache_page based on the NULL parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Bug: 133186739
Change-Id: I720afb2e0c28d5649b62ed3852b48ac63dc0d7a8
Link: https://lkml.org/lkml/2019/5/1/294
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2019-05-20 17:46:42 -07:00
Linus Torvalds
1d1d2f0762 UPSTREAM: filemap: add a comment about FAULT_FLAG_RETRY_NOWAIT behavior
I thought Josef Bacik's patch to drop the mmap_sem was buggy, because
when looking at the error cases, there was one case where we returned
VM_FAULT_RETRY without actually dropping the mmap_sem.

Josef had to explain to me (using small words) that yes, that's actually
what we're supposed to do, and his patch was correct.  Which not only
convinced me he knew what he was doing and I should stop arguing with
him, but also that I should add a comment to the case I was confused
about.

Bug: 124328118
Change-Id: I28390f4924110ee259c20304753bd64a8f6121f8
Patiently-pointed-out-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8b0f9fa2e02dc95216577c3387b0707c5f60fbaf)
Signed-off-by: Minchan Kim <minchan@google.com>
2019-03-29 16:01:12 +09:00
Josef Bacik
9f5fa1fd7a BACKPORT: filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock.  The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.

The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.

Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.

Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions.  Consider some large company that does a
lot of web traffic.  This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons.  Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.

Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time.  This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur.  Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.

This patch also adds a new helper for locking the page with the mmap_sem
dropped.  This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked.  However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page.  This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.

Test: fio mmap ioengine with 8-thread on 4cpu under memory pressure for
stability check
Test: locking holding latency measure with lockstat for performance
check
Bug: 124328118
Change-Id: I0af79cbe6f2ef9fe3927961e042347490b892684
[josef@toxicpanda.com: v6]
  Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
  Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11)
Signed-off-by: Minchan Kim <minchan@google.com>
2019-03-29 16:01:06 +09:00
Josef Bacik
76b6ee8f38 BACKPORT: filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.

Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions.  Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.

We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such.  This involves running
'ps' or other such utilities.  These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.

This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer.  This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal.  But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.

In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO.  This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock.  With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.

This patch (of 3):

If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it.  This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache.  Then use the normal page locking and readpage logic already in
filemap_fault.  This simplifies the no page in page cache case
significantly.

Bug: 124328118
Change-Id: I6e04df24ba52bb96c05c9386248983f477ab78e5
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
  Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a75d4c33377277b6034dd1e2663bce444f952c14)
Signed-off-by: Minchan Kim <minchan@google.com>
2019-03-29 16:01:02 +09:00
Josef Bacik
0cb18b7b49 UPSTREAM: filemap: pass vm_fault to the mmap ra helpers
All of the arguments to these functions come from the vmf.

Cut down on the amount of arguments passed by simply passing in the vmf
to these two helpers.

Bug: 124328118
Change-Id: I874f8320d8393f2e39aab030fcf43e51c882e90a
Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 2a1180f1bd389e9d47693e5eb384b95f482d8d19)
Signed-off-by: Minchan Kim <minchan@google.com>
2019-03-29 16:00:56 +09:00
Johannes Weiner
e550f94252 UPSTREAM: psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
580f26b93e UPSTREAM: delayacct: track delays from thrashing cache pages
Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache.  This
isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit b1d29ba82cf2bc784f4c963ddd6a2cf29e229b33)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I259f693987cf04e6a52ee7e8accf55a17e0de005
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:26 -07:00
Johannes Weiner
b4a56abdf7 UPSTREAM: mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing.  Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache.  When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime.  This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.

Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8508cf3ffad4defa202b303e5b6379efc4cd9054)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I71df060dce5590a3c654f9a0e8e54deeb74b64c2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:26 -07:00
Souptick Joarder
2bcd6454ba mm: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.

Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:36 -07:00
Matthew Wilcox
abc1be13fd mm/filemap.c: fix NULL pointer in page_cache_tree_insert()
f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
Unfortunately, the page cache also uses the mapping's GFP flags for
allocating radix tree nodes.  It always masked off the __GFP_HIGHMEM
flag, and masks off __GFP_ZERO in some paths, but not all.  That causes
radix tree nodes to be allocated with a NULL list_head, which causes
backtraces like:

  __list_del_entry+0x30/0xd0
  list_lru_del+0xac/0x1ac
  page_cache_tree_insert+0xd8/0x110

The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
if they are ever used.  Fix them all by using GFP_RECLAIM_MASK at the
innermost location, and remove it from earlier in the callchain.

Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Chris Fries <cfries@google.com>
Debugged-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:36 -07:00
Arnd Bergmann
453972283d mm/filemap.c: provide dummy filemap_page_mkwrite() for NOMMU
Building orangefs on MMU-less machines now results in a link error
because of the newly introduced use of the filemap_page_mkwrite()
function:

  ERROR: "filemap_page_mkwrite" [fs/orangefs/orangefs.ko] undefined!

This adds a dummy version for it, similar to the existing
generic_file_mmap and generic_file_readonly_mmap stubs in the same file,
to avoid the link error without adding #ifdefs in each file system that
uses these.

Link: http://lkml.kernel.org/r/20180409105555.2439976-1-arnd@arndb.de
Fixes: a5135eeab2 ("orangefs: implement vm_ops->fault")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:27 -07:00
Matthew Wilcox
b93b016313 page cache: use xa_lock
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Yang Shi
2b9fceb3b4 mm/filemap.c: remove include of hardirq.h
in_atomic() has been moved to include/linux/preempt.h, and the filemap.c
doesn't use in_atomic() directly at all, so it sounds unnecessary to
include hardirq.h.

Link: http://lkml.kernel.org/r/1509985319-38633-1-git-send-email-yang.s@alibaba-inc.com
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:36 -08:00
Linus Torvalds
487e2c9f44 AFS development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWgm9V/Sw1s6N8H32AQK5mQ//QGUDZLXsUPCtq0XJq0V+r4MUjNp9tCZR
 htiuNrEkHSyPpYgCcQ2Aqdl9kndwVXcE7lWT99mp/a0zwNAsp9GOGVhCXUd5R86G
 XlrBuUYVvBJk18tDsUNWdjRQ0gMHgQSlEnEbsaGiU1bVrpXatI9hL8qoeO78Iy7+
 eaJUQLCuCVJq7qMQGhC0hg338vmHVeYhnViXIxq+HFjsMmR9IVanuK+sQr6NSJxS
 F6RkPxBUPWkRVMHmxTLWj/XSHZwtwu+Mnc/UFYsAPLKEbY0cIohsI8EgfE8U7geU
 yRVnu3MIOXUXUrZizj9SwVYWdJfneRlINqMbHIO8QXMKR38tnQ0C2/7bgBsXiNPv
 YdiAyeqL4nM+JthV/rgA3hWgupwBlSb4ubclTphDNxMs5MBIUIK3XUt9GOXDDUZz
 2FT/FdrphM2UORaI2AEOi4Q0/nHdin+3rld8fjV0Ree/TPNXwcrOmvy8yGnxFCEp
 5b7YLwKrffZGnnS965dhZlnFR6hjndmzFgHdyRrJwc80hXi1Q/+W4F19MoYkkoVK
 G/gLvD3FbmygmFnjCik9TjUrro6vQxo56H/TuWgHTvYriNGH+D/D7EGUwg4GiXZZ
 +7vrNw660uXmZiu9i0YacCRyD8lvm7QpmWLb+uHwzfsBE1+C8UetyQ+egSWVdWJO
 KwPspygWXD4=
 =3vy0
 -----END PGP SIGNATURE-----

Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS updates from David Howells:
 "kAFS filesystem driver overhaul.

  The major points of the overhaul are:

   (1) Preliminary groundwork is laid for supporting network-namespacing
       of kAFS. The remainder of the namespacing work requires some way
       to pass namespace information to submounts triggered by an
       automount. This requires something like the mount overhaul that's
       in progress.

   (2) sockaddr_rxrpc is used in preference to in_addr for holding
       addresses internally and add support for talking to the YFS VL
       server. With this, kAFS can do everything over IPv6 as well as
       IPv4 if it's talking to servers that support it.

   (3) Callback handling is overhauled to be generally passive rather
       than active. 'Callbacks' are promises by the server to tell us
       about data and metadata changes. Callbacks are now checked when
       we next touch an inode rather than actively going and looking for
       it where possible.

   (4) File access permit caching is overhauled to store the caching
       information per-inode rather than per-directory, shared over
       subordinate files. Whilst older AFS servers only allow ACLs on
       directories (shared to the files in that directory), newer AFS
       servers break that restriction.

       To improve memory usage and to make it easier to do mass-key
       removal, permit combinations are cached and shared.

   (5) Cell database management is overhauled to allow lighter locks to
       be used and to make cell records autonomous state machines that
       look after getting their own DNS records and cleaning themselves
       up, in particular preventing races in acquiring and relinquishing
       the fscache token for the cell.

   (6) Volume caching is overhauled. The afs_vlocation record is got rid
       of to simplify things and the superblock is now keyed on the cell
       and the numeric volume ID only. The volume record is tied to a
       superblock and normal superblock management is used to mediate
       the lifetime of the volume fscache token.

   (7) File server record caching is overhauled to make server records
       independent of cells and volumes. A server can be in multiple
       cells (in such a case, the administrator must make sure that the
       VL services for all cells correctly reflect the volumes shared
       between those cells).

       Server records are now indexed using the UUID of the server
       rather than the address since a server can have multiple
       addresses.

   (8) File server rotation is overhauled to handle VMOVED, VBUSY (and
       similar), VOFFLINE and VNOVOL indications and to handle rotation
       both of servers and addresses of those servers. The rotation will
       also wait and retry if the server says it is busy.

   (9) Data writeback is overhauled. Each inode no longer stores a list
       of modified sections tagged with the key that authorised it in
       favour of noting the modified region of a page in page->private
       and storing a list of keys that made modifications in the inode.

       This simplifies things and allows other keys to be used to
       actually write to the server if a key that made a modification
       becomes useless.

  (10) Writable mmap() is implemented. This allows a kernel to be build
       entirely on AFS.

  Note that Pre AFS-3.4 servers are no longer supported, though this can
  be added back if necessary (AFS-3.4 was released in 1998)"

* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
  afs: Protect call->state changes against signals
  afs: Trace page dirty/clean
  afs: Implement shared-writeable mmap
  afs: Get rid of the afs_writeback record
  afs: Introduce a file-private data record
  afs: Use a dynamic port if 7001 is in use
  afs: Fix directory read/modify race
  afs: Trace the sending of pages
  afs: Trace the initiation and completion of client calls
  afs: Fix documentation on # vs % prefix in mount source specification
  afs: Fix total-length calculation for multiple-page send
  afs: Only progress call state at end of Tx phase from rxrpc callback
  afs: Make use of the YFS service upgrade to fully support IPv6
  afs: Overhaul volume and server record caching and fileserver rotation
  afs: Move server rotation code into its own file
  afs: Add an address list concept
  afs: Overhaul cell database management
  afs: Overhaul permit caching
  afs: Overhaul the callback handling
  afs: Rename struct afs_call server member to cm_server
  ...
2017-11-16 11:41:22 -08:00
Mel Gorman
453f85d43f mm: remove __GFP_COLD
As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of.  Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact.  Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache.  Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path.  A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman
8667982014 mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman
c7df8ad291 mm, truncate: do not check mapping for every page being truncated
During truncation, the mapping has already been checked for shmem and
dax so it's known that workingset_update_node is required.

This patch avoids the checks on mapping for each page being truncated.
In all other cases, a lookup helper is used to determine if
workingset_update_node() needs to be called.  The one danger is that the
API is slightly harder to use as calling workingset_update_node directly
without checking for dax or shmem mappings could lead to surprises.
However, the API rarely needs to be used and hopefully the comment is
enough to give people the hint.

sparsetruncate (tiny)
                              4.14.0-rc4             4.14.0-rc4
                             oneirq-v1r1        pickhelper-v1r1
Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)

As you'd expect, the gain is marginal but it can be detected.  The
differences in bonnie are all within the noise which is not surprising
given the impact on the microbenchmark.

radix_tree_update_node_t is a callback for some radix operations that
optionally passes in a private field.  The only user of the callback is
workingset_update_node and as it no longer requires a mapping, the
private field is removed.

Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
aa65c29ce1 mm: batch radix tree operations when truncating pages
Currently we remove pages from the radix tree one by one.  To speed up
page cache truncation, lock several pages at once and free them in one
go.  This allows us to batch radix tree operations in a more efficient
way and also save round-trips on mapping->tree_lock.  As a result we
gain about 20% speed improvement in page cache truncation.

Data from a simple benchmark timing 10000 truncates of 1024 pages (on
ext4 on ramdisk but the filesystem is barely visible in the profiles).
The range shows 1% and 95% percentiles of the measured times:

  4.14-rc2	4.14-rc2 + batched truncation
  248-256	209-219
  249-258	209-217
  248-255	211-239
  248-255	209-217
  247-256	210-218

[jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
  Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
[akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
5ecc4d852c mm: factor out checks and accounting from __delete_from_page_cache()
Move checks and accounting updates from __delete_from_page_cache() into
a separate function.  We will reuse it when batching page cache
truncation operations.

Link: http://lkml.kernel.org/r/20171010151937.26984-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
2300638b12 mm: move clearing of page->mapping to page_cache_tree_delete()
Clearing of page->mapping makes sense in page_cache_tree_delete() as
well and it will help us with batching things this way.

Link: http://lkml.kernel.org/r/20171010151937.26984-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
76253fbc8f mm: move accounting updates before page_cache_tree_delete()
Move updates of various counters before page_cache_tree_delete() call.
It will be easier to batch things this way and there is no difference
whether the counters get updated before or after removal from the radix
tree.

Link: http://lkml.kernel.org/r/20171010151937.26984-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
59c66c5f8c mm: factor out page cache page freeing into a separate function
Factor out page freeing from delete_from_page_cache() into a separate
function.  We will need to call the same when batching pagecache
deletion operations.

invalidate_complete_page2() and replace_page_cache_page() might want to
call this function as well however they currently don't seem to handle
THPs so it's unnecessary for them to take the hit of checking whether a
page is THP or not.

Link: http://lkml.kernel.org/r/20171010151937.26984-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
67fd707f46 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
312e9d2f70 mm: use pagevec_lookup_range_tag() in __filemap_fdatawait_range()
Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is
interested only in pages from given range.  Remove unnecessary code
resulting from this.

Link: http://lkml.kernel.org/r/20171009151359.31984-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
72b045aecd mm: implement find_get_pages_range_tag()
Patch series "Ranged pagevec tagged lookup", v3.

In this series I provide a ranged variant of pagevec_lookup_tag() and
use it in places where it makes sense.  This series removes some common
code and it also has a potential for speeding up some operations
similarly as for pagevec_lookup_range() (but for now I can think of only
artificial cases where this happens).

This patch (of 16):

Implement a variant of find_get_pages_tag() that stops iterating at
given index.  Lots of users of this function (through pagevec_lookup())
actually want a range lookup and all of them are currently open-coding
this.

Also create corresponding pagevec_lookup_range_tag() function.

Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Steve French <sfrench@samba.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
David Howells
4343d00872 afs: Get rid of the afs_writeback record
Get rid of the afs_writeback record that kAFS is using to match keys with
writes made by that key.

Instead, keep a list of keys that have a file open for writing and/or
sync'ing and iterate through those.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13 15:38:20 +00:00
Jeff Layton
f4e222c56c mm: have filemap_check_and_advance_wb_err clear AS_EIO/AS_ENOSPC
Eryu noticed that he could sometimes get a leftover error reported when
it shouldn't be on fsync with ext2 and non-journalled ext4.

The problem is that writeback_single_inode still uses filemap_fdatawait.
That picks up a previously set AS_EIO flag, which would ordinarily have
been cleared before.

Since we're mostly using this function as a replacement for
filemap_check_errors, have filemap_check_and_advance_wb_err clear AS_EIO
and AS_ENOSPC when reporting an error.  That should allow the new
function to better emulate the behavior of the old with respect to these
flags.

Link: http://lkml.kernel.org/r/20170922133331.28812-1-jlayton@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:24 -07:00
Lukas Czerner
332391a993 fs: Fix page cache inconsistency when mixing buffered and AIO DIO
Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Linus Torvalds
e253d98f5b Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nowait read support from Al Viro:
 "Support IOCB_NOWAIT for buffered reads and block devices"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block_dev: support RFW_NOWAIT on block device nodes
  fs: support RWF_NOWAIT for buffered reads
  fs: support IOCB_NOWAIT in generic_file_buffered_read
  fs: pass iocb to do_generic_file_read
2017-09-14 19:29:55 -07:00
Tim Chen
11a19c7b09 sched/wait: Introduce wakeup boomark in wake_up_page_bit
Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.

We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long.  This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.

[ v2: Remove bookmark_wake_function ]

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-14 09:56:18 -07:00
Linus Torvalds
d34fc1adf0 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - DAX updates

 - OCFS2

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
  mm,fork: introduce MADV_WIPEONFORK
  x86,mpx: make mpx depend on x86-64 to free up VMA flag
  mm: add /proc/pid/smaps_rollup
  mm: hugetlb: clear target sub-page last when clearing huge page
  mm: oom: let oom_reap_task and exit_mmap run concurrently
  swap: choose swap device according to numa node
  mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  z3fold: use per-cpu unbuddied lists
  mm, swap: don't use VMA based swap readahead if HDD is used as swap
  mm, swap: add sysfs interface for VMA based swap readahead
  mm, swap: VMA based swap readahead
  mm, swap: fix swap readahead marking
  mm, swap: add swap readahead hit statistics
  mm/vmalloc.c: don't reinvent the wheel but use existing llist API
  mm/vmstat.c: fix wrong comment
  selftests/memfd: add memfd_create hugetlbfs selftest
  mm/shmem: add hugetlbfs support to memfd_create()
  mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
  mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
  ...
2017-09-06 20:49:49 -07:00
Jan Kara
f7b6804687 mm: use find_get_pages_range() in filemap_range_has_page()
We want only pages from given range in filemap_range_has_page(),
furthermore we want at most a single page.

So use find_get_pages_range() instead of pagevec_lookup() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20170726114704.7626-10-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
b947cee4b9 mm: implement find_get_pages_range()
Implement a variant of find_get_pages() that stops iterating at given
index.  This may be substantial performance gain if the mapping is
sparse.  See following commit for details.  Furthermore lots of users of
this function (through pagevec_lookup()) actually want a range lookup
and all of them are currently open-coding this.

Also create corresponding pagevec_lookup_range() function.

Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
d72dc8a25a mm: make pagevec_lookup() update index
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue.  Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Ross Zwisler
d01ad197ac dax: remove DAX code from page_cache_tree_insert()
Now that we no longer insert struct page pointers in DAX radix trees we
can remove the special casing for DAX in page_cache_tree_insert().

This also allows us to make dax_wake_mapping_entry_waiter() local to
fs/dax.c, removing it from dax.h.

Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Linus Torvalds
ec3604c7a5 Writeback error handling fixes for v4.14
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZrTy3AAoJEAAOaEEZVoIVaucP/ApBAj2S5wzvlV1u6l8E6ae7
 ZeEEZfcWwzRYlKjZAkTWqj9XvGpDGO5gLq4wsZK2edFAq++/MJF8ZVtN4tdZ1kUZ
 DUvRodtVOrT08Kp9wZXGT7JOFrf6U/6gMcR6p0MuWnHndeKYvlpcFi9NPT4EC9/z
 Zm9V7gtlPdSOha7eaSjUS0+vLERkxqXLBW3Av9QUOBP/lbI3lqIroGKeHDYnVdya
 2P/k5EcRRJMyJP6TqyYxmmJl+UWjJFMLvnlUDBslHnD/u3mIUhw3JLHYBjn5dZRE
 Xjq56IDPoXDUvzlBhtn/Uqyx+/wtwsNsylpmKv6K5G1JfdeuSsPVsCey+A1cqV64
 LpE5896wf9TmnmI9LNyh6vDn925xPSGBiF45UEp5f9aO7jXeY0MaEZ8g+ENqFIDK
 v4gtZdS9FhYHV+/l4qEwYMKrqSbwKEs1r1FT+f4wnABby1ojfdA57ZPlp5PV2Vjp
 szTp88Zkb7cMvZwEnWwxWofcJNmgS7uNahvnQF3IJ4ITsioEkuyYR3K4ZQMaaaV9
 wCp6G0FhXZaK3OI7o9WiDwaO2elp9Hxc8bnqKpiBbHZkY0NLh7/++5VxpeNbTHFy
 AGijQiiKGNNyYqNj93wq9jpVdMNjB0pXrHRxfav8v7MtQ+WfbEoAENF4T7hN7iXn
 UuF6eSWEC5O1UCRUk1A+
 =LLY3
 -----END PGP SIGNATURE-----

Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull writeback error handling updates from Jeff Layton:
 "This pile continues the work from last cycle on better tracking
  writeback errors. In v4.13 we added some basic errseq_t infrastructure
  and converted a few filesystems to use it.

  This set continues refining that infrastructure, adds documentation,
  and converts most of the other filesystems to use it. The main
  exception at this point is the NFS client"

* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  ecryptfs: convert to file_write_and_wait in ->fsync
  mm: remove optimizations based on i_size in mapping writeback waits
  fs: convert a pile of fsync routines to errseq_t based reporting
  gfs2: convert to errseq_t based writeback error reporting for fsync
  fs: convert sync_file_range to use errseq_t based error-tracking
  mm: add file_fdatawait_range and file_write_and_wait
  fuse: convert to errseq_t based error tracking for fsync
  mm: consolidate dax / non-dax checks for writeback
  Documentation: add some docs for errseq_t
  errseq: rename __errseq_set to errseq_set
2017-09-06 14:11:03 -07:00
Milosz Tanski
3239d83484 fs: support IOCB_NOWAIT in generic_file_buffered_read
Allow generic_file_buffered_read to bail out early instead of waiting for
the page lock or reading a page if IOCB_NOWAIT is specified.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Sage Weil <sage@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:04:22 -04:00
Christoph Hellwig
47c27bc469 fs: pass iocb to do_generic_file_read
And rename it to the more descriptive generic_file_buffered_read while
at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:04:22 -04:00
Linus Torvalds
9c3a815f47 page waitqueue: always add new entries at the end
Commit 3510ca20ec ("Minor page waitqueue cleanups") made the page
queue code always add new waiters to the back of the queue, which helps
upcoming patches to batch the wakeups for some horrid loads where the
wait queues grow to thousands of entries.

However, I forgot about the nasrt add_page_wait_queue() special case
code that is only used by the cachefiles code.  That one still continued
to add the new wait queue entries at the beginning of the list.

Fix it, because any sane batched wakeup will require that we don't
suddenly start getting new entries at the beginning of the list that we
already handled in a previous batch.

[ The current code always does the whole list while holding the lock, so
  wait queue ordering doesn't matter for correctness, but even then it's
  better to add later entries at the end from a fairness standpoint ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 16:45:40 -07:00
Linus Torvalds
a8b169afbf Avoid page waitqueue race leaving possible page locker waiting
The "lock_page_killable()" function waits for exclusive access to the
page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
set.

That means that if it gets woken up, other waiters may have been
skipped.

That, in turn, means that if it sees the page being unlocked, it *must*
take that lock and return success, even if a lethal signal is also
pending.

So instead of checking for lethal signals first, we need to check for
them after we've checked the actual bit that we were waiting for.  Even
if that might then delay the killing of the process.

This matches the order of the old "wait_on_bit_lock()" infrastructure
that the page locking used to use (and is still used in a few other
areas).

Note that if we still return an error after having unsuccessfully tried
to acquire the page lock, that is ok: that means that some other thread
was able to get ahead of us and lock the page, and when that other
thread then unlocks the page, the wakeup event will be repeated.  So any
other pending waiters will now get properly woken up.

Fixes: 6290602709 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Kara <jack@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-27 16:25:09 -07:00
Linus Torvalds
3510ca20ec Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists.  The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.

Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes.  That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.

In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably.  If you have thousands of threads
waiting for the same page, it will be painful.  We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.

That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:

 (a) we don't want to continue walking the page wait list if the bit
     we're waiting for already got set again (which seems to be one of
     the patterns of the bad load).  That makes no progress and just
     causes pointless cache pollution chasing the pointers.

 (b) we don't want to put the non-locking waiters always on the front of
     the queue, and the locking waiters always on the back.  Not only is
     that unfair, it means that we wake up thousands of reading threads
     that will just end up being blocked by the writer later anyway.

Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members.  It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-27 13:55:12 -07:00
Jeff Layton
ffb959bbdf mm: remove optimizations based on i_size in mapping writeback waits
Marcelo added this i_size based optimization with a patch in 2004
(commitid is from the linux-history tree):

    commit 765dad09b4ac101a32d87af2bb793c3060497d3c
    Author: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
    Date:   Tue Sep 7 17:51:17 2004 -0700

	small wait_on_page_writeback_range() optimization

	filemap_fdatawait() calls wait_on_page_writeback_range() with -1
	as "end" parameter.  This is not needed since we know the EOF
	from the inode.  Use that instead.

There may be races here, particularly with clustered or network
filesystems. It also seems like a bit of a layering violation since
we're operating on an address_space here, not an inode.

Finally, it's also questionable whether this optimization really helps
on workloads that we care about. Should we be optimizing for writeback
vs. truncate races in a codepath where we expect to wait anyway? It
doesn't seem worth the risk.

Remove this optimization from the filemap_fdatawait codepaths. This
means that filemap_fdatawait becomes a trivial wrapper around
filemap_fdatawait_range.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-01 08:39:29 -04:00