Commit graph

45208 commits

Author SHA1 Message Date
Mel Gorman
11fb998986 mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
4b9d0fab71 mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
50658e2e04 mm: move page mapped accounting to the node
Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information.  Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko
44a70adec9 mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
oom_score_adj is shared for the thread groups (via struct signal) but this
is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer.  In the
worst case some of those processes might hide from the oom killer
altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
would then pick up those eligible but won't be allowed to kill others
sharing the same mm so the mm wouldn't release the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that is
the natural entity OOM killer considers.  But this will not work because
some programs are doing

	vfork()
	set_oom_adj()
	exec()

We can achieve the same though.  oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not in
the middle of vfork.  As a result all the processes will share the same
oom_score_adj.  The current implementation is rather pessimistic and
checks all the existing processes by default if there is more than 1
holder of the mm but we do not have any reliable way to check for external
users yet.

Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko
1d5f0acbc6 proc, oom_adj: extract oom_score_adj setting into a helper
Currently we have two proc interfaces to set oom_score_adj.  The legacy
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj which both have their
specific handlers.  Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper.  Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Link: http://lkml.kernel.org/r/1466426628-15074-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko
f913da596a proc, oom: drop bogus sighand lock
Oleg has pointed out that can simplify both oom_adj_{read,write} and
oom_score_adj_{read,write} even further and drop the sighand lock.  The
main purpose of the lock was to protect p->signal from going away but this
will not happen since ea6d290ca3 ("signals: make task_struct->signal
immutable/refcountable").

The other role of the lock was to synchronize different writers,
especially those with CAP_SYS_RESOURCE.  Introduce a mutex for this
purpose.  Later patches will need this lock anyway.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/1466426628-15074-3-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko
d49fbf766d proc, oom: drop bogus task_lock and mm check
Series "Handle oom bypass more gracefully", V5

The following 10 patches should put some order to very rare cases of mm
shared between processes and make the paths which bypass the oom killer
oom reapable and therefore much more reliable finally.  Even though mm
shared outside of thread group is rare (either vforked tasks for a short
period, use_mm by kernel threads or exotic thread model of
clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them.  Not
only it makes the current oom killer logic quite hard to follow and
reason about it can lead to weird corner cases.  E.g.  it is possible to
select an oom victim which shares the mm with unkillable process or
bypass the oom killer even when other processes sharing the mm are still
alive and other weird cases.

Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
This can be considered a bug fix with a low impact as nobody has noticed
for years.

Patch 2 drops sighand lock because it is not needed anymore as pointed
by Oleg.

Patch 3 is a clean up of oom_score_adj handling and a preparatory work
for later patches.

Patch 4 enforces oom_adj_score to be consistent between processes
sharing the mm to behave consistently with the regular thread groups.
This can be considered a user visible behavior change because one thread
group updating oom_score_adj will affect others which share the same mm
via clone(CLONE_VM).  I argue that this should be acceptable because we
already have the same behavior for threads in the same thread group and
sharing the mm without signal struct is just a different model of
threading.  This is probably the most controversial part of the series,
I would like to find some consensus here.  There were some suggestions
to hook some counter/oom_score_adj into the mm_struct but I feel that
this is not necessary right now and we can rely on proc handler +
oom_kill_process to DTRT.  I can be convinced otherwise but I strongly
think that whatever we do the userspace has to have a way to see the
current oom priority as consistently as possible.

Patch 5 makes sure that no vforked task is selected if it is sharing the
mm with oom unkillable task.

Patch 6 ensures that all user tasks sharing the mm are killed which in
turn makes sure that all oom victims are oom reapable.

Patch 7 guarantees that task_will_free_mem will always imply reapable
bypass of the oom killer.

Patch 8 is new in this version and it addresses an issue pointed out by
0-day OOM report where an oom victim was reaped several times.

Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
task and hides it from the oom killer to move on when no progress can be
made.  This will give an upper bound to how long an oom_reapable task
can block the oom killer from selecting another victim if the oom_reaper
is not able to reap the victim.

Patch 10 tries to plug the (hopefully) last hole when we can still lock
up when the oom victim is shared with oom unkillable tasks (kthreads and
global init).  We just try to be best effort in that case and rather
fallback to kill something else than risk a lockup.

This patch (of 10):

Both oom_adj_write and oom_score_adj_write are using task_lock, check for
task->mm and fail if it is NULL.  This is not needed because the
oom_score_adj is per signal struct so we do not need mm at all.  The code
has been introduced by 3d5992d2ac ("oom: add per-mm oom disable count")
but we do not do per-mm oom disable since c9f01245b6 ("oom: remove
oom_disable_count").

The task->mm check is even not correct because the current thread might
have exited but the thread group might be still alive - e.g.  thread group
leader would lead that echo $VAL > /proc/pid/oom_score_adj would always
fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would
succeed.  This is unexpected at best.

Remove the lock along with the check to fix the unexpected behavior and
also because there is not real need for the lock in the first place.

Link: http://lkml.kernel.org/r/1466426628-15074-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Linus Torvalds
468fc7ed55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

 2) Make DSA binding more sane, from Andrew Lunn.

 3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

 4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

 5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface.  From Brenden Blanco and
    others.

 6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

 7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

 8) Simplify netlink conntrack entry layout, from Florian Westphal.

 9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

11) Support qdisc packet injection in pktgen, from John Fastabend.

12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

13) Add NV congestion control support to TCP, from Lawrence Brakmo.

14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

16) Support MPLS over IPV4, from Simon Horman.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
  xgene: Fix build warning with ACPI disabled.
  be2net: perform temperature query in adapter regardless of its interface state
  l2tp: Correctly return -EBADF from pppol2tp_getname.
  net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
  net: ipmr/ip6mr: update lastuse on entry change
  macsec: ensure rx_sa is set when validation is disabled
  tipc: dump monitor attributes
  tipc: add a function to get the bearer name
  tipc: get monitor threshold for the cluster
  tipc: make cluster size threshold for monitoring configurable
  tipc: introduce constants for tipc address validation
  net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
  MAINTAINERS: xgene: Add driver and documentation path
  Documentation: dtb: xgene: Add MDIO node
  dtb: xgene: Add MDIO node
  drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
  drivers: net: xgene: Use exported functions
  drivers: net: xgene: Enable MDIO driver
  drivers: net: xgene: Add backward compatibility
  drivers: net: phy: xgene: Add MDIO driver
  ...
2016-07-27 12:03:20 -07:00
Linus Torvalds
ba4f67899f dlm for 4.8
This set includes two trivial changes, one to
 use kmemdup and another to control the log level
 of recovery messages.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmM4vAAoJEDgbc8f8gGmqZiwP/jHjVeSBqt3OML2iEuL5CN7E
 0GGjRAsRaUTx8GgeAJfC/HlOlTCD4FnQfQmZ0SZ6bPluhxGJGhxX8ujMsdhSB7KS
 1mxfH9tYYhm/6WyTsLbmhdnt9zxU7uqHi1K2Zd6zIxf541TFXGd2CHXu4gOdQCAd
 LIHory3yhn8QTmHs2zWObuNcTfHRHl1Nk6cZ4PCwfNhdFCxwILycwcTRr/8bd2XJ
 AlqueCsEoMVrYST7HB99ih/CE6rqU/DFkN81mMa2RQWy9PiicWic7uggZrTr4i1+
 0oyc4C+sXBKRYUdtbKneEB4/jobUSR5YRkkEpHWOv8wimgY4xAVHsBJGhG9c3nPq
 cgaSblDwI5Mbz3Bz0tUMwzgrX7CmgCaOLKUlep5CMEkdFH0ROEwBiBibGXeQGloI
 UW2WmCgnLMw1PVAcC5oZr9FvYq0OochK14xwb8ksa7E/ry1bcRh0mXD7prgeOS3B
 VyJxu5e1cAm8tUtEk0ZIp8sAmLMUheBpl+YLl+bU5yG2VvfNtMdsFuZxZdtcmsgn
 5rXI42RjtmX8i1SBm15DQQ7/28xzDWfX4xF6qYhzmFUiOmfqyIQZ2/ShJ/wi7tA4
 zrYm1YPh+LkuBn7kbdyerSOMI9WYeGhSMDXIuZJ+j79ucQhErLyLsNnBRiv8A5SJ
 Nc4e+nJxsZT8AOCTsk35
 =qqak
 -----END PGP SIGNATURE-----

Merge tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set includes two trivial changes, one to use kmemdup and another
  to control the log level of recovery messages"

* tag 'dlm-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: Use kmemdup instead of kmalloc and memcpy
  dlm: add log_info config option
2016-07-27 10:47:24 -07:00
Linus Torvalds
4fc29c1aa3 The major change in this version is mitigating cpu overheads on write paths by
replacing redundant inode page updates with mark_inode_dirty calls. And we tried
 to reduce lock contentions as well to improve filesystem scalability.
 Other feature is setting F2FS automatically when detecting host-managed SMR.
 
 = Enhancement =
  - ioctl to move a range of data between files
  - inject orphan inode errors
  - avoid flush commands congestion
  - support lazytime
 
 = Bug fixes =
  - return proper results for some dentry operations
  - fix deadlock in add_link failure
  - disable extent_cache for fcollapse/finsert
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmDJFAAoJEEAUqH6CSFDSJeYP/0ru8+5/ui5VTCdNPQB9KxYD
 DIUaDGpeoLvmn3ZdrMEdyNr6kWbgjCE9JjOGPQ7l1/apErOGVPyaBwflKcCDwloU
 pAlEqVM1Q9j4qH4i9SWTlvPtsHBHB7G7YSe3vDB9fJGSTqumubIlnaBm+Wfjx31U
 p53WcPn9LpOyzfmvZf2tOHmvZ7bWLkE/a07x9kPC6XHUFb9C17jLRFFGeuhZQHv1
 Yo7HgokBnPExa8TnEILYyX/x+eecFS/1Cp/cN0STsebSu8pStTHTcAP7qEpKQB88
 Cc51Lf+d5gFeydxKDFxwdH3VWOGIr9Ppako+lHW83gJcHP0zw8zdxULab+HJMa4n
 MOByRRiafwu1sL0dl7TCfsYNIHdEnXhWbhcRhMVZbb5C2Q6+Htuac8ZrKSOWExNN
 DUqRkzeTib9u+cHxUTFFPgOGdUjDLmg3XHU7mvb+2hViluVjIImC4tqD5XPpv7vt
 WnaDJxLCGD/6DF2yhiVY9NysuxInLTNFFCF06LworZ4L24hlg5TvN0UeUNRO9954
 ux6f+lSORCzV3TmrsHP5vwjSAW26FviPXV1q1HHJeTpWKMlhsZtHmOAJOtZKKmxP
 WFnHT0aiWF+sQf4qfxVQL+lLqtgRKJAI9zqGRyfDJWJp5aXdRuVsZs9pWNQF7lCo
 5gVnCYk3ULjXG3b23j2S
 =tKTR
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "The major change in this version is mitigating cpu overheads on write
  paths by replacing redundant inode page updates with mark_inode_dirty
  calls.  And we tried to reduce lock contentions as well to improve
  filesystem scalability.  Other feature is setting F2FS automatically
  when detecting host-managed SMR.

  Enhancements:
   - ioctl to move a range of data between files
   - inject orphan inode errors
   - avoid flush commands congestion
   - support lazytime

  Bug fixes:
   - return proper results for some dentry operations
   - fix deadlock in add_link failure
   - disable extent_cache for fcollapse/finsert"

* tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
  f2fs: clean up coding style and redundancy
  f2fs: get victim segment again after new cp
  f2fs: handle error case with f2fs_bug_on
  f2fs: avoid data race when deciding checkpoin in f2fs_sync_file
  f2fs: support an ioctl to move a range of data blocks
  f2fs: fix to report error number of f2fs_find_entry
  f2fs: avoid memory allocation failure due to a long length
  f2fs: reset default idle interval value
  f2fs: use blk_plug in all the possible paths
  f2fs: fix to avoid data update racing between GC and DIO
  f2fs: add maximum prefree segments
  f2fs: disable extent_cache for fcollapse/finsert inodes
  f2fs: refactor __exchange_data_block for speed up
  f2fs: fix ERR_PTR returned by bio
  f2fs: avoid mark_inode_dirty
  f2fs: move i_size_write in f2fs_write_end
  f2fs: fix to avoid redundant discard during fstrim
  f2fs: avoid mismatching block range for discard
  f2fs: fix incorrect f_bfree calculation in ->statfs
  f2fs: use percpu_rw_semaphore
  ...
2016-07-27 10:36:31 -07:00
Linus Torvalds
0e6acf0204 xfs: update for 4.8-rc1
Changes in this update:
 o generic iomap based IO path infrastructure
 o generic iomap based fiemap implementation
 o xfs iomap based Io path implementation
 o buffer error handling fixes
 o tracking of in flight buffer IO for unmount serialisation
 o direct IO and DAX io path separation and simplification
 o shortform directory format definition changes for wider platform compatibility
 o various buffer cache fixes
 o cleanups in preparation for rmap merge
 o error injection cleanups and fixes
 o log item format buffer memory allocation restructuring to prevent rare OOM
   reclaim deadlocks
 o sparse inode chunks are now fully supported.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmA5XAAoJEK3oKUf0dfodCc0QAKY5Jlfw5HwLria+Ad87HCcM
 Zi/LGMMC3CPh+vkbqsmDnLKHYjXRwi3HamBoXdufiE8E3UtOjp/sV98/fCw+zwhe
 tHDLmdAx23RLTn7gUhcsIXydKeXh0+HlRxPa4eBAlmnsJ3nGgrKrKQLgDT7Gjlum
 nPfRSTYjzm5gs2dpUTYhMV7MplenDW9GFz2uBMct6N9kYQ9m225I99fd/4nb/L7R
 o/8UocsK7iREUXP6decDoN9uIAzE2mYR720EL+Txy09CTYy+luNyGoNXOsQtxT5O
 plyoPZbzIIDvC44bvp6bZX96Udm7tAeTloieInCZG13I2zJy9gmTmLqkZ3M2at12
 kOyeAMSBOWQYSa3uh++FsEP+JGtBTlZXf+4DAYf+U08s8tMVE/61/RZrtJZF4OjW
 hyumRBD6zqZ9Y6Qtji2HaA3l9IGxOC2k4URw9JZdDDyMoRTQvawN1QWNAeZINXiv
 9ywqTruVsfQnoGDC1Gk1OEfQpubNztTAkEPqVM7ez5dkwOdwuOZXcZPL1Ltvb4Bt
 PLaWKLIYFYZKrM5kqgQlTERspSQA99++z8H9a21wFezfetaBby28fIqwMMfQAiSw
 nCq95WshJPwenogMtWjNfOgs/fqOBKdPdLFw0H6Jpmjwna2KpuFIZiTnwu25vvjz
 dHh4DVSuMTq1pBkXEU7B
 =vcSd
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "The major addition is the new iomap based block mapping
  infrastructure.  We've been kicking this about locally for years, but
  there are other filesystems want to use it too (e.g. gfs2).  Now it
  is fully working, reviewed and ready for merge and be used by other
  filesystems.

  There are a lot of other fixes and cleanups in the tree, but those are
  XFS internal things and none are of the scale or visibility of the
  iomap changes.  See below for details.

  I am likely to send another pull request next week - we're just about
  ready to merge some new functionality (on disk block->owner reverse
  mapping infrastructure), but that's a huge chunk of code (74 files
  changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
  separate to all the "normal" pull request changes so they don't get
  lost in the noise.

  Summary of changes in this update:
   - generic iomap based IO path infrastructure
   - generic iomap based fiemap implementation
   - xfs iomap based Io path implementation
   - buffer error handling fixes
   - tracking of in flight buffer IO for unmount serialisation
   - direct IO and DAX io path separation and simplification
   - shortform directory format definition changes for wider platform
     compatibility
   - various buffer cache fixes
   - cleanups in preparation for rmap merge
   - error injection cleanups and fixes
   - log item format buffer memory allocation restructuring to prevent
     rare OOM reclaim deadlocks
   - sparse inode chunks are now fully supported"

* tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
  xfs: remove EXPERIMENTAL tag from sparse inode feature
  xfs: bufferhead chains are invalid after end_page_writeback
  xfs: allocate log vector buffers outside CIL context lock
  libxfs: directory node splitting does not have an extra block
  xfs: remove dax code from object file when disabled
  xfs: skip dirty pages in ->releasepage()
  xfs: remove __arch_pack
  xfs: kill xfs_dir2_inou_t
  xfs: kill xfs_dir2_sf_off_t
  xfs: split direct I/O and DAX path
  xfs: direct calls in the direct I/O path
  xfs: stop using generic_file_read_iter for direct I/O
  xfs: split xfs_file_read_iter into buffered and direct I/O helpers
  xfs: remove s_maxbytes enforcement in xfs_file_read_iter
  xfs: kill ioflags
  xfs: don't pass ioflags around in the ioctl path
  xfs: track and serialize in-flight async buffers against unmount
  xfs: exclude never-released buffers from buftarg I/O accounting
  xfs: don't reset b_retries to 0 on every failure
  xfs: remove extraneous buffer flag changes
  ...
2016-07-27 09:53:35 -07:00
Linus Torvalds
0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Linus Torvalds
9c1958fc32 media updates for v4.8-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXlfJvAAoJEAhfPr2O5OEVtLUP/RpCQ+W3YVryIdmLkdmYXoY7
 m2rXtUh7GmzBjaBkFzbRCGZtgROF7zl0e1R3nm4tLbCV4Becw8HO7YiMjqFJm9xr
 b6IngIyshsHf60Eii3RpLqUFvYrc/DDIMeYf8miwj/PvFAfI2BV9apraexJlpUuI
 wdyi28cfBHq4WYhubaXKoAyBQ8YRA/t8KNRAkDlifaOaMbSAxWHlmqoSmJWeQx73
 KHkSvbRPu4Hjo3R6q/ab8VhqmXeSnbqnQB9lgnxz7AmAZGhOlMYeAhV/K2ZwbBH8
 swv36RmJVO59Ov+vNR4p7GGGDL3+qk8JLj4LNVVfOcW0A+t7WrPQEmrL6VsyaZAy
 /+r4NEOcQN6Z5nFwbr3E0tYJ2Y5jFHOvsBfKd3EEGwty+hCl634akgb0vqtg06cg
 E2KG+XW983RBadVwEBnEudxJb0fWPWHGhXEqRrwOD+718FNmTqYM6dEvTEyxRup8
 EtCLj+eQQ4LmAyZxWyE8A+keKoMFQlHqk9LN9vQ7t7Wxq9mQ+V2l12T/lN4VhdTq
 4QZ4mrCMCGEvNcNzgSg6R/9lVb6RHDtMXZ3htbB/w+5xET/IKIANYyg1Hr7ahtdh
 rTW/4q6n3jtsu6tp5poteFvPzZKAblbrj2EptVzZYkonQ5BeAUisFTtneUL10Jmj
 EUf/sH0fqoOA0VvV6Tu+
 =mrOW
 -----END PGP SIGNATURE-----

Merge tag 'media/v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:

 - new framework support for HDMI CEC and remote control support

 - new encoding codec driver for Mediatek SoC

 - new frontend driver: helene tuner

 - added support for NetUp almost universal devices, with supports
   DVB-C/S/S2/T/T2 and ISDB-T

 - the mn88472 frontend driver got promoted from staging

 - a new driver for RCar video input

 - some soc_camera legacy drivers got removed: timb, omap1, mx2, mx3

 - lots of driver cleanups, improvements and fixups

* tag 'media/v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (377 commits)
  [media] cec: always check all_device_types and features
  [media] cec: poll should check if there is room in the tx queue
  [media] vivid: support monitor all mode
  [media] cec: fix test for unconfigured adapter in main message loop
  [media] cec: limit the size of the transmit queue
  [media] cec: zero unused msg part after msg->len
  [media] cec: don't set fh to NULL in CEC_TRANSMIT
  [media] cec: clear all status fields before transmit and always fill in sequence
  [media] cec: CEC_RECEIVE overwrote the timeout field
  [media] cxd2841er: Reading SNR for DVB-C added
  [media] cxd2841er: Reading BER and UCB for DVB-C added
  [media] cxd2841er: fix switch-case for DVB-C
  [media] cxd2841er: fix signal strength scale for ISDB-T
  [media] cxd2841er: adjust the dB scale for DVB-C
  [media] cxd2841er: provide signal strength for DVB-C
  [media] cxd2841er: fix BER report via DVBv5 stats API
  [media] mb86a20s: apply mask to val after checking for read failure
  [media] airspy: fix error logic during device register
  [media] s5p-cec/TODO: add TODO item
  [media] cec/TODO: drop comment about sphinx documentation
  ...
2016-07-26 18:59:59 -07:00
Linus Torvalds
1b3fc0bef8 pstore subsystem updates for v4.8
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJXloQYAAoJEIly9N/cbcAmGC0QAIpyoiqEuDiJq/XpRg1ux+PC
 Vyr15Pub9yQYwcrWffMH1Zr0GhlFmXb1iP9rp36zdtMhjBEfq7wegvblLVMlzl6G
 7nYt8hJjDh/h8iw1lElgDL2kwUbTym43HoczJNvY/lOmFuUMK8AoDIRYjFTLAKfQ
 S4KA9MFJe3kDh4OUoQVfQNrC2VReLD4uvXk4EUF0wDYoqjVKyU3WBHOMgEmggKTR
 cb+fwhg3Lj4cuMMtZqy8wCqZ/hqhaH8giHC9YbIZQyre3ylncH9xUZyfiqS6nQGc
 eLc03qxqDNsmZvcY6cJgXldLQ3tXM4o96Moakzn2n4sQcW9vh/3oZzDPd7gC8Ei1
 GfIXmRBXFhj5JaeHNGJxL6oCywK+JaqxG8nqD7cEcXTzJiHzjn5kKKSFlr3GmI7w
 47htXv9t07SMgQW0IlBws5yApfeB62dQXmhZc1kMtbonhGdAZCCUg2Nrv34VxrjX
 Dp+LCmD5bg/fBrnAt8f+IIQEd3pElngay+SmSEB9XFUejf3pKw8SvcoDbmE3LD7M
 zGh5bEkptHll2GMInVAt4b4tiC44e7u+0H1Rsi/ttA1cktXZ9hOtFewhXNvfOS2I
 hAMGSngOdpzR5v9Mof5hJAWrr1CLkoh757UoYMsb8u2V9aQw7oVZ5JRMzjZBU6iC
 qXqHm4h5P1bpAeit3DLF
 =fiRI
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore subsystem updates from Kees Cook:
 "This expands the supported compressors, fixes some bugs, and finally
  adds DT bindings"

* tag 'pstore-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: add Device Tree bindings
  efi-pstore: implement efivars_pstore_exit()
  pstore: drop file opened reference count
  pstore: add lzo/lz4 compression support
  pstore: Cleanup pstore_dump()
  pstore: Enable compression on normal path (again)
  ramoops: Only unregister when registered
2016-07-26 18:48:23 -07:00
Linus Torvalds
d31dcd9247 Orangefs cleanups and enablement of O_DIRECT in open.
Cleanups:
  - remove some unused defines, and also some obfuscatory ones.
  - remove a redundant xattr handler.
  - Remove useless xattr prefix arguments.
  - Be more picky about uid and gid handling WRT namespaces.
    Our use of current_user_ns() instead of init_user_ns left
    open the possibility that users could spoof their uids
    or gids when the server was running in a different namespace
    in "default security" mode.
  - Allow open(2) to succeed with O_DIRECT.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJXllqWAAoJEM9EDqnrzg2+Hk0P/3CWdOzUa59zhDn63SD+/SIg
 VMP3xjVLD8FMvIQKQB+wS+WQgeJln7DJET2dxqDLCmcqpC31EjJOSbswALdYH/tC
 Gbm9Sx2hJF07Efr1H6IxwDO38ZW6UTcMpeDBa/I2V1v8Q9quMaViE3wxcK6RqNTe
 sPhGKjnpqG/b2zia7/tFiP0qem2KbjQguNT9vZIo5OYbFUzmh5AzQL/pyqd/5lz6
 +pKxRl6dfEiAmvo0GsPF8ZZgrITs6oW7/Ul2cPu4Zs+YhcTQP7KEotuYdb3c8QLj
 py6NPjCjDJtAKg2yJ0b695sCe4dzOTwaV9hAalxOoOmUUGpGl8tKYCPDSNd3Ugs4
 s13DlEwSsFMtt4FpkKT5m5yjr83pMom+uWkrzsQ+uypgNvgDMtCSmaC9uJ2531jp
 VMpfc2EW8NhuQj1cn36dXKQRyWFC7+cQ3BHG10UVw93y0X18lOUIysKYW/NiQ7C/
 fYUz5TXPCIrN8kvso2PHF3wFL9mf+8pLEEEocg7KKb5lgwhQ/FvBlhKOkQBZPNAB
 Z2y+GukvM8OtTka5/I5wsW1a2xqziy0Z3nW79LtUJ1MOgEFsyXCRaxNIMfg4QXty
 yn524bbY7XBWr8pPqG4jb1FqSQ+qTgILfimEZ0+8rbj2bfoSqfi9yaYHyZ1YJejv
 YCnss1TvZ9Uf7/juLh2e
 =tW7d
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Mashall:
 "Orangefs cleanups and enablement of O_DIRECT in open.

  Cleanups:

   - remove some unused defines, and also some obfuscatory ones.

   - remove a redundant xattr handler.

   - Remove useless xattr prefix arguments.

   - Be more picky about uid and gid handling WRT namespaces.

     Our use of current_user_ns() instead of init_user_ns left open the
     possibility that users could spoof their uids or gids when the
     server was running in a different namespace in "default security"
     mode.

   - Allow open(2) to succeed with O_DIRECT"

* tag 'for-linus-4.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: fix namespace handling
  Orangefs: allow O_DIRECT in open
  orangefs: Remove useless xattr prefix arguments
  orangefs: Remove redundant "trusted." xattr handler
  orangefs: Remove useless defines
2016-07-26 18:42:18 -07:00
Linus Torvalds
396d10993f The major change this cycle is deleting ext4's copy of the file system
encryption code and switching things over to using the copies in
 fs/crypto.  I've updated the MAINTAINERS file to add an entry for
 fs/crypto listing Jaeguk Kim and myself as the maintainers.
 
 There are also a number of bug fixes, most notably for some problems
 found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
 fixed is a writeback deadlock detected by generic/130, and some
 potential races in the metadata checksum code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXlbP9AAoJEPL5WVaVDYGjGxgIAJ9YIqme//yix63oHYLhDNea
 lY/TLqZrb9/TdDRvGyZa3jYaKaIejL53eEQS9nhEB/JI0sEiDpHmOrDOxdj8Hlsw
 fm7nJyh1u4vFKPyklCbIvLAje1vl8X/6OvqQiwh45gIxbbsFftaBWtccW+UtEkIP
 Fx65Vk7RehJ/sNrM0cRrwB79YAmDS8P6BPyzdMRk+vO/uFqyq7Auc+pkd+bTlw/m
 TDAEIunlk0Ovjx75ru1zaemL1JJx5ffehrJmGCcSUPHVbMObOEKIrlV50gAAKVhO
 qbZAri3mhDvyspSLuS/73L9skeCiWFLhvojCBGu4t2aa3JJolmItO7IpKi4HdRU=
 =bxGK
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The major change this cycle is deleting ext4's copy of the file system
  encryption code and switching things over to using the copies in
  fs/crypto.  I've updated the MAINTAINERS file to add an entry for
  fs/crypto listing Jaeguk Kim and myself as the maintainers.

  There are also a number of bug fixes, most notably for some problems
  found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
  fixed is a writeback deadlock detected by generic/130, and some
  potential races in the metadata checksum code"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: verify extent header depth
  ext4: short-cut orphan cleanup on error
  ext4: fix reference counting bug on block allocation error
  MAINTAINRES: fs-crypto maintainers update
  ext4 crypto: migrate into vfs's crypto engine
  ext2: fix filesystem deadlock while reading corrupted xattr block
  ext4: fix project quota accounting without quota limits enabled
  ext4: validate s_reserved_gdt_blocks on mount
  ext4: remove unused page_idx
  ext4: don't call ext4_should_journal_data() on the journal inode
  ext4: Fix WARN_ON_ONCE in ext4_commit_super()
  ext4: fix deadlock during page writeback
  ext4: correct error value of function verifying dx checksum
  ext4: avoid modifying checksum fields directly during checksum verification
  ext4: check for extents that wrap around
  jbd2: make journal y2038 safe
  jbd2: track more dependencies on transaction commit
  jbd2: move lockdep tracking to journal_s
  jbd2: move lockdep instrumentation for jbd2 handles
  ext4: respect the nobarrier mount option in nojournal mode
  ...
2016-07-26 18:35:55 -07:00
Kirill A. Shutemov
65c453778a mm, rmap: account shmem thp pages
Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
smaps.  It indicates how many times we allocate and map shmem THP.

NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov
bae473a423 mm: introduce fault_env
The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context without
endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michal Hocko
8a5c743e30 mm, memcg: use consistent gfp flags during readahead
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs.  This gfp mask discrepancy is really unfortunate and easily
fixable.  Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well.  The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
  Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov
d86133bd39 pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.

This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.

A note regarding anon_pipe_buf_steal implementation.  We allow to steal
the page if its ref count equals 1.  It looks racy, but it is correct
for anonymous pipe buffer pages, because:

 - We lock out all other pipe users, because ->steal is called with
   pipe_lock held, so the page can't be spliced to another pipe from
   under us.

 - The page is not on LRU and it never was.

 - Thus a parallel thread can access it only by PFN. Although this is
   quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
   this is not dangerous, because all such functions do is increase page
   ref count, check if the page is the one they are looking for, and
   decrease ref count if it isn't. Since our page is clean except for
   PageKmemcg mark, which doesn't conflict with other _mapcount users,
   the worst that can happen is we see page_count > 2 due to a transient
   ref, in which case we false-positively abort ->steal, which is still
   fine, because ->steal is not guaranteed to succeed.

Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Brian Foster
9a46b04f16 fs/fs-writeback.c: inode writeback list tracking tracepoints
The per-sb inode writeback list tracks inodes currently under writeback
to facilitate efficient sync processing.  In particular, it ensures that
sync only needs to walk through a list of inodes that were cleaned by
the sync.

Add a couple tracepoints to help identify when inodes are added/removed
to and from the writeback lists.  Piggyback off of the writeback
lazytime tracepoint template as it already tracks the relevant inode
information.

Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
cc: Josef Bacik <jbacik@fb.com>
Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Dave Chinner
6c60d2b574 fs/fs-writeback.c: add a new writeback list for sync
wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync.  This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for.  We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback.  wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback.  Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
piaojun
7d65b27448 ocfs2/cluster: clean up unnecessary assignment for 'ret'
Clean up unnecessary assignment for 'ret'.

Link: http://lkml.kernel.org/r/578C61F6.4080403@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi
e81f1c5c4a ocfs2: remove obscure BUG_ON in dlmglue
These BUG_ON(!inode) are obscure because we have already used inode to
get osb.  And actually we can guarantee here inode is valid in the
context.  So we can safely remove them.

Link: http://lkml.kernel.org/r/5776336A.6030104@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Eric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi
698d44b43a ocfs2: cleanup implemented prototypes
Several prototypes in inode.h are just defined but not actually
implemented and used, so remove them.

Link: http://lkml.kernel.org/r/57763787.4020706@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi
8ec7b17a66 ocfs2/dlm: fix memory leak of dlm_debug_ctxt
dlm_debug_ctxt->debug_refcnt is initialized to 1 and then increased to 2
by dlm_debug_get in dlm_debug_init.  But dlm_debug_put is called only
once in dlm_debug_shutdown during unregister dlm, which leads to
dlm_debug_ctxt leaked.

Link: http://lkml.kernel.org/r/577BB755.4030900@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi
a8f24f1b3f ocfs2: cleanup unneeded goto in ocfs2_create_new_inode_locks
The last goto is unneeded, so remove it.

Link: http://lkml.kernel.org/r/576213D3.6080002@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Junxiao Bi
0b492f68bb ocfs2: improve recovery performance
Journal replay will be run when performing recovery for a dead node.  To
avoid the stale cache impact, all blocks of dead node's journal inode
were reloaded from disk.  This hurts the performance.  Check whether one
block is cached before reloading it can improve performance a lot.  In
my test env, the time doing recovery was improved from 120s to 1s.

[akpm@linux-foundation.org: clean up the for loop p_blkno handling]
Link: http://lkml.kernel.org/r/1466155682-24656-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: "Gang He" <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Eric Ren
191df2b513 ocfs2: fix a redundant re-initialization
Obviously, memset() has zeroed the whole struct locking_max_version.
So, it's no need to zero its two fields individually.

Link: http://lkml.kernel.org/r/1463970605-18354-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ross Zwisler
6b524995a7 dax: remote unused fault wrappers
Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

   5726b27b09 ("ext2: Add locking for DAX faults")
   ea3d7209ca ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

   6b698edeee ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Linus Torvalds
3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Linus Torvalds
d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Linus Torvalds
55392c4c06 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "This update provides the following changes:

   - The rework of the timer wheel which addresses the shortcomings of
     the current wheel (cascading, slow search for next expiring timer,
     etc).  That's the first major change of the wheel in almost 20
     years since Finn implemted it.

   - A large overhaul of the clocksource drivers init functions to
     consolidate the Device Tree initialization

   - Some more Y2038 updates

   - A capability fix for timerfd

   - Yet another clock chip driver

   - The usual pile of updates, comment improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
  tick/nohz: Optimize nohz idle enter
  clockevents: Make clockevents_subsys static
  clocksource/drivers/time-armada-370-xp: Fix return value check
  timers: Implement optimization for same expiry time in mod_timer()
  timers: Split out index calculation
  timers: Only wake softirq if necessary
  timers: Forward the wheel clock whenever possible
  timers/nohz: Remove pointless tick_nohz_kick_tick() function
  timers: Optimize collect_expired_timers() for NOHZ
  timers: Move __run_timers() function
  timers: Remove set_timer_slack() leftovers
  timers: Switch to a non-cascading wheel
  timers: Reduce the CPU index space to 256k
  timers: Give a few structs and members proper names
  hlist: Add hlist_is_singular_node() helper
  signals: Use hrtimer for sigtimedwait()
  timers: Remove the deprecated mod_timer_pinned() API
  timers, net/ipv4/inet: Initialize connection request timers as pinned
  timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
  timers, drivers/tty/metag_da: Initialize the poll timer as pinned
  ...
2016-07-25 20:43:12 -07:00
Linus Torvalds
0f657262d5 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Various x86 low level modifications:

   - preparatory work to support virtually mapped kernel stacks (Andy
     Lutomirski)

   - support for 64-bit __get_user() on 32-bit kernels (Benjamin
     LaHaise)

   - (involved) workaround for Knights Landing CPU erratum (Dave Hansen)

   - MPX enhancements (Dave Hansen)

   - mremap() extension to allow remapping of the special VDSO vma, for
     purposes of user level context save/restore (Dmitry Safonov)

   - hweight and entry code cleanups (Borislav Petkov)

   - bitops code generation optimizations and cleanups with modern GCC
     (H. Peter Anvin)

   - syscall entry code optimizations (Paolo Bonzini)"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  x86/mm/cpa: Add missing comment in populate_pdg()
  x86/mm/cpa: Fix populate_pgd(): Stop trying to deallocate failed PUDs
  x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2
  x86/smp: Remove unnecessary initialization of thread_info::cpu
  x86/smp: Remove stack_smp_processor_id()
  x86/uaccess: Move thread_info::addr_limit to thread_struct
  x86/dumpstack: Rename thread_struct::sig_on_uaccess_error to sig_on_uaccess_err
  x86/uaccess: Move thread_info::uaccess_err and thread_info::sig_on_uaccess_err to thread_struct
  x86/dumpstack: When OOPSing, rewind the stack before do_exit()
  x86/mm/64: In vmalloc_fault(), use CR3 instead of current->active_mm
  x86/dumpstack/64: Handle faults when printing the "Stack: " part of an OOPS
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  x86/mm/cpa: In populate_pgd(), don't set the PGD entry until it's populated
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm: Use pte_none() to test for empty PTE
  x86/mm: Disallow running with 32-bit PTEs to work around erratum
  x86/mm: Ignore A/D bits in pte/pmd/pud_none()
  x86/mm: Move swap offset/type up in PTE to work around erratum
  x86/entry: Inline enter_from_user_mode()
  ...
2016-07-25 15:34:18 -07:00
Kees Cook
74e630a758 Linux 4.7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXlRXSAAoJEHm+PkMAQRiGG/gH/0Z8O4zWOsrwO+X1mRToRDBH
 joFOjAmCVe83T1VpF5LYNB+9+owL/dEDt6+ZIswnhH7AfQPjs4RqwS4PcuMbCDVO
 +mDm0PmfcKaYcQZrB2Z2OwIzRNnfCTVcsDPhIHwuIHk0m4z/xuGZonD8KoAj0+tO
 3yJF6sbE1KubDVjOb+lmZZSP3cXA0pDXrNhkYhE4Tsr8fiihGjeXSNJ8t2zPLjxo
 W3MPqo0rzDvQsOwoF4TWHHagVaFSJlhLBBgqu33fI7uO3jtfQD2G8wG68JCND1j3
 qbMoBfTLFV/yQmSIJUt0Wv1axaCcwnjpweEB35A/GEeZ0mNB1rDdoBeI1eKEQkc=
 =DGFC
 -----END PGP SIGNATURE-----

Merge tag 'v4.7' into for-linus/pstore

Linux 4.7
2016-07-25 13:50:36 -07:00
Jaegeuk Kim
5302fb000d f2fs: clean up coding style and redundancy
This patch includes minor clean-ups.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-25 12:58:12 -07:00
Linus Torvalds
9d0be76f52 Char/Misc driver patches for 4.8-rc1
Here is the big char/misc driver update for 4.8-rc1.
 
 Not a lot of stuff, but it's all over the place, full details are in the
 shortlog below.  All of these have been in linux-next with no reported
 issues for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iFYEABECABYFAleVPBsPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspEQgAoJOX
 nSWKA7j4JMGy1v+uNIqsgUmUAJsFyS388N+Faa2K4uyp7CYQ6jaAZw==
 =0Ofd
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big char/misc driver update for 4.8-rc1.

  Not a lot of stuff, but it's all over the place, full details are in
  the shortlog.  All of these have been in linux-next with no reported
  issues for a while"

* tag 'char-misc-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (49 commits)
  lkdtm: silence warnings about function declarations
  lkdtm: hide unused functions
  intel_th: pci: Add Kaby Lake PCH-H support
  intel_th: Fix a deadlock in modprobing
  dsp56k: prevent a harmless underflow
  chardev: add missing line break in pr_warn
  lkdtm: use struct arrays instead of enums
  lkdtm: move jprobe entry points to start of source
  lkdtm: reorganize module paramaters
  lkdtm: rename globals for clarity
  lkdtm: rename "count" to "crash_count"
  lkdtm: remove intentional off-by-one array access
  lkdtm: split remaining logic bug tests to separate file
  lkdtm: split heap corruption tests to separate file
  lkdtm: split memory permissions tests to separate file
  lkdtm: split usercopy tests to separate file
  lkdtm: drop "alloc_size" parameter
  lkdtm: add usercopy test for blocking kernel text
  extcon: adc-jack: add suspend/resume support
  extcon: add missing of_node_put after calling of_parse_phandle
  ...
2016-07-24 16:26:26 -07:00
Linus Torvalds
b403f23044 We've got ten patches this time, half of which are related to a plethora
of nasty outcomes when inodes are transitioned from the unlinked state
 to the free state. Small file systems are particularly vulnerable to these
 problems, and it can manifest as mainly hangs, but also file system
 corruption. The patches have been tested for literally many weeks, with a
 very gruelling test, so I have a high level of confidence.
 
 - Andreas Gruenbacher wrote a series of 5 patches for various lockups
   during the transition of inodes from unlinked to free. The main patch
   is titled "Fix gfs2_lookup_by_inum lock inversion" and the other 4 are
   support and cleanup patches related to that.
 - Ben Marzinski contributed 2 patches with regard to a recreatable
   problem when gfs2 tries to write a page to a file that is being
   truncated, resulting in a BUG() in gfs2_remove_from_journal.
   Note that Ben had to export vfs function __block_write_full_page to get
   this to work properly. It's been posted a long time and he talked to
   various VFS people about it, and nobody seemed to mind.
 - I contributed 3 patches. (1) The first one fixes a memory corruptor:
   a race in which one process can overwrite the gl_object pointer set by
   another process, causing kernel panic and other symptoms. (2) The second
   patch fixes another race that resulted in a false-positive BUG_ON. This
   occurred when resource group reservations were freed by one process
   while another process was trying to grab a new reservation in the same
   resource group. (3) The third patch fixes a problem with doing journal
   replay when the journals are not all the same size.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXklXIAAoJENeLYdPf93o7AbIIAImLEixK+4CaItEArAKG9TXv
 WbO+eDJfo6AOtAteB6+MdX2UxXAHJsCY6RmiEIAi5LzlVFiiCgRo4z/QgDARAw3c
 2RxlndElaESh82S27sLiFbgZeY7GZv04C0t6AzMkc830BLXiKMs6bXfeq1fzW8Sf
 AgAneACVsX0faRWo/XDuQcK81dwZ+qdOnR2+FvtOSFl1KgV0BrtnsW7IHv+5MIot
 SREDN7VvSQwQrLgwMlC0PvhwK3KCVvuO9ZziLEPpYJONESJfEmuCpG265+tUJNTw
 dIcW3p/vvgow8fb56fSnAxaeplPLlF9qJCq1M9fWZrKVbHg2uyCZMx4P52Fnmz4=
 =uUVs
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got ten patches this time, half of which are related to a
  plethora of nasty outcomes when inodes are transitioned from the
  unlinked state to the free state.  Small file systems are particularly
  vulnerable to these problems, and it can manifest as mainly hangs, but
  also file system corruption.  The patches have been tested for
  literally many weeks, with a very gruelling test, so I have a high
  level of confidence.

   - Andreas Gruenbacher wrote a series of five patches for various
     lockups during the transition of inodes from unlinked to free.

     The main patch is titled "Fix gfs2_lookup_by_inum lock inversion"
     and the other four are support and cleanup patches related to that.

   - Ben Marzinski contributed two patches with regard to a recreatable
     problem when gfs2 tries to write a page to a file that is being
     truncated, resulting in a BUG() in gfs2_remove_from_journal.

     Note that Ben had to export vfs function __block_write_full_page to
     get this to work properly.  It's been posted a long time and he
     talked to various VFS people about it, and nobody seemed to mind.

   - I contributed 3 patches:
       o The first one fixes a memory corruptor: a race in which one
         process can overwrite the gl_object pointer set by another
         process, causing kernel panic and other symptoms.
       o The second patch fixes another race that resulted in a
         false-positive BUG_ON.  This occurred when resource group
         reservations were freed by one process while another process
         was trying to grab a new reservation in the same resource
         group.
       o The third patch fixes a problem with doing journal replay when
         the journals are not all the same size"

* tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
  GFS2: Check rs_free with rd_rsspin protection
  gfs2: writeout truncated pages
  fs: export __block_write_full_page
  gfs2: Lock holder cleanup
  gfs2: Large-filesystem fix for 32-bit systems
  gfs2: Get rid of gfs2_ilookup
  gfs2: Fix gfs2_lookup_by_inum lock inversion
  gfs2: Initialize iopen glock holder for new inodes
  GFS2: don't set rgrp gl_object until it's inserted into rgrp tree
2016-07-24 16:07:52 -07:00
David S. Miller
de0ba9a0d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just several instances of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24 00:53:32 -04:00
Linus Torvalds
88083e9845 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This contains a fix for a potential crash/corruption issue and another
  where the suid/sgid bits weren't cleared on write"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: verify upper dentry in ovl_remove_and_whiteout()
  ovl: Copy up underlying inode's ->i_mode to overlay inode
  ovl: handle ATTR_KILL*
2016-07-23 14:25:02 +09:00
Yunlei He
fe94793e55 f2fs: get victim segment again after new cp
Previous selected segment may become free after write_checkpoint,
if we do garbage collect on this segment, and then new_curseg happen
to reuse it, it may cause f2fs_bug_on as below.

	panic+0x154/0x29c
	do_garbage_collect+0x15c/0xaf4
	f2fs_gc+0x2dc/0x444
	f2fs_balance_fs.part.22+0xcc/0x14c
	f2fs_balance_fs+0x28/0x34
	f2fs_map_blocks+0x5ec/0x790
	f2fs_preallocate_blocks+0xe0/0x100
	f2fs_file_write_iter+0x64/0x11c
	new_sync_write+0xac/0x11c
	vfs_write+0x144/0x1e4
	SyS_write+0x60/0xc0

Here, maybe we check sit and ssa type during reset_curseg. So, we check
segment is stale or not, and select a new victim to avoid this.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-22 11:55:31 -07:00
Maxim Patlasov
cfc9fde0b0 ovl: verify upper dentry in ovl_remove_and_whiteout()
The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f3710417 ("ovl: verify upper dentry before unlink and rename").

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-07-22 10:54:20 +02:00
Dave Chinner
f2bdfda9a1 Merge branch 'xfs-4.8-misc-fixes-4' into for-next 2016-07-22 14:10:56 +10:00
Dave Chinner
72ccbbe154 xfs: remove EXPERIMENTAL tag from sparse inode feature
Been around for long enough now, hasn't caused any regression test
failures in the past 3 months, so it's time to make it a fully
supported feature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 14:10:18 +10:00
Dave Chinner
28b783e47a xfs: bufferhead chains are invalid after end_page_writeback
In xfs_finish_page_writeback(), we have a loop that looks like this:

        do {
                if (off < bvec->bv_offset)
                        goto next_bh;
                if (off > end)
                        break;
                bh->b_end_io(bh, !error);
next_bh:
                off += bh->b_size;
        } while ((bh = bh->b_this_page) != head);

The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO.  This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.

While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:

.....
 INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
  free_buffer_head+0x41/0x90
  __slab_free+0x1ed/0x340
  kmem_cache_free+0x270/0x300
  free_buffer_head+0x41/0x90
  try_to_free_buffers+0x171/0x240
  xfs_vm_releasepage+0xcb/0x3b0
  try_to_release_page+0x106/0x190
  shrink_page_list+0x118e/0x1a10
  shrink_inactive_list+0x42c/0xdf0
  shrink_zone_memcg+0xa09/0xfa0
  shrink_zone+0x2c3/0xbc0
.....
 Call Trace:
  <IRQ>  [<ffffffff81e8b8e4>] dump_stack+0x68/0x94
  [<ffffffff8153a995>] print_trailer+0x115/0x1a0
  [<ffffffff81541174>] object_err+0x34/0x40
  [<ffffffff815436e7>] kasan_report_error+0x217/0x530
  [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50
  [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0
  [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220
  [<ffffffff81de0c58>] bio_endio+0x158/0x1b0
  [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80
  [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0
  [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690
  [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0
  [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340


Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.

Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.

Yet another example of why Bufferheads Must Die.

cc: <stable@vger.kernel.org> # 4.7
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-Tested-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:56:38 +10:00
Dave Chinner
b1c5ebb213 xfs: allocate log vector buffers outside CIL context lock
One of the problems we currently have with delayed logging is that
under serious memory pressure we can deadlock memory reclaim. THis
occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
inodes and issues a log force to unpin inodes that are dirty in the
CIL.

The CIL is pushed, but this will only occur once it gets the CIL
context lock to ensure that all committing transactions are complete
and no new transactions start being committed to the CIL while the
push switches to a new context.

The deadlock occurs when the CIL context lock is held by a
committing process that is doing memory allocation for log vector
buffers, and that allocation is then blocked on memory reclaim
making progress. Memory reclaim, however, is blocked waiting for
a log force to make progress, and so we effectively deadlock at this
point.

To solve this problem, we have to move the CIL log vector buffer
allocation outside of the context lock so that memory reclaim can
always make progress when it needs to force the log. The problem
with doing this is that a CIL push can take place while we are
determining if we need to allocate a new log vector buffer for
an item and hence the current log vector may go away without
warning. That means we canot rely on the existing log vector being
present when we finally grab the context lock and so we must have a
replacement buffer ready to go at all times.

To ensure this, introduce a "shadow log vector" buffer that is
always guaranteed to be present when we gain the CIL context lock
and format the item. This shadow buffer may or may not be used
during the formatting, but if the log item does not have an existing
log vector buffer or that buffer is too small for the new
modifications, we swap it for the new shadow buffer and format
the modifications into that new log vector buffer.

The result of this is that for any object we modify more than once
in a given CIL checkpoint, we double the memory required
to track dirty regions in the log. For single modifications then
we consume the shadow log vectorwe allocate on commit, and that gets
consumed by the checkpoint. However, if we make multiple
modifications, then the second transaction commit will allocate a
shadow log vector and hence we will end up with double the memory
usage as only one of the log vectors is consumed by the CIL
checkpoint. The remaining shadow vector will be freed when th elog
item is freed.

This can probably be optimised in future - access to the shadow log
vector is serialised by the object lock (as opposited to the active
log vector, which is controlled by the CIL context lock) and so we
can probably free shadow log vector from some objects when the log
item is marked clean on removal from the AIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:52:35 +10:00
Dave Chinner
160ae76fa1 libxfs: directory node splitting does not have an extra block
xfsprogs source commit 4280e59dcbc4cd8e01585efe788a68eb378048e8

xfs_da3_split() has to handle all three versions of the
directory/attribute btree structure. The attr tree is v1, the dir
tre is v2 or v3. The main difference between the v1 and v2/3 trees
is the way tree nodes are split - in the v1 tree we can require a
double split to occur because the object to be inserted may be
larger than the space made by splitting a leaf. In this case we need
to do a double split - one to split the full leaf, then another to
allocate an empty leaf block in the correct location for the new
entry.  This does not happen with dir (v2/v3) formats as the objects
being inserted are always guaranteed to fit into the new space in
the split blocks.

Indeed, for directories they *may* be an extra block on this buffer
pointer. However, it's guaranteed not to be a leaf block (i.e. a
directory data block) - the directory code only ever places hash
index or free space blocks in this pointer (as a cursor of
sorts), and so to use it as a directory data block will immediately
corrupt the directory.

The problem is that the code assumes that there may be extra blocks
that we need to link into the tree once we've split the root, but
this is not true for either dir or attr trees, because the extra
attr block is always consumed by the last node split before we split
the root. Hence the linking in an extra block is always wrong at the
root split level, and this manifests itself in repair as a directory
corruption in a repaired directory, leaving the directory rebuild
incomplete.

This is a dir v2 zero-day bug - it was in the initial dir v2 commit
that was made back in February 1998.

Fix this by ensuring the linking of the blocks after the root split
never tries to make use of the extra blocks that may be held in the
cursor. They are held there for other purposes and should never be
touched by the root splitting code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:51:05 +10:00
Arnd Bergmann
f021bd071f xfs: remove dax code from object file when disabled
We check IS_DAX(inode) before calling either xfs_file_dax_read or
xfs_file_dax_write, and this will lead the call being optimized out at
compile time when CONFIG_FS_DAX is disabled.

However, the two functions are marked STATIC, so they become global
symbols when CONFIG_XFS_DEBUG is set, leaving us with two unused global
functions that call into an undefined function and a broken "allmodconfig"
build:

fs/built-in.o: In function `xfs_file_dax_read':
fs/xfs/xfs_file.c:348: undefined reference to `dax_do_io'
fs/built-in.o: In function `xfs_file_dax_write':
fs/xfs/xfs_file.c:758: undefined reference to `dax_do_io'

Marking the two functions 'static noinline' instead of 'STATIC' will let
the compiler drop the symbols when there are no callers but avoid the
implicit inlining.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 16d4d43595 ("xfs: split direct I/O and DAX path")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:50:55 +10:00
Brian Foster
99579ccec4 xfs: skip dirty pages in ->releasepage()
XFS has had scattered reports of delalloc blocks present at
->releasepage() time. This results in a warning with a stack trace
similar to the following:

 ...
 Call Trace:
  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
  [<ffffffffa2168539>] kswapd+0x4f9/0x970
  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100

This occurs because it is possible for shrink_active_list() to send
pages marked dirty to ->releasepage() when certain buffer_head threshold
conditions are met. shrink_active_list() doesn't check the page dirty
state apparently to handle an old ext3 corner case where in some cases
clean pages would not have the dirty bit cleared, thus it is up to the
filesystem to determine how to handle the page.

XFS currently handles the delalloc case properly, but this behavior
makes the warning spurious. Update the XFS ->releasepage() handler to
explicitly skip dirty pages. Retain the existing delalloc/unwritten
checks so we continue to warn if such buffers exist on clean pages when
they shouldn't.

Diagnosed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:50:38 +10:00
Bob Peterson
e1cb6be9e1 GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
Before this patch, if you used gfs2_jadd to add new journals of a
size smaller than the existing journals, replaying those new journals
would withdraw. That's because function gfs2_replay_incr_blk was
using the number of journal blocks (jd_block) from the superblock's
journal pointer. In other words, "My journal's max size" rather than
"the journal we're replaying's size." This patch changes the function
to use the size of the pertinent journal rather than always using the
journal we happen to be using.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-07-21 13:02:44 -05:00