Commit graph

43506 commits

Author SHA1 Message Date
Ross Zwisler
30f471fd88 dax: check return value of dax_radix_entry()
dax_pfn_mkwrite() previously wasn't checking the return value of the
call to dax_radix_entry(), which was a mistake.

Instead, capture this return value and return the appropriate VM_FAULT_
value.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Jan Kara
566e8dfd88 ocfs2: fix return value from ocfs2_page_mkwrite()
ocfs2_page_mkwrite() could mistakenly return error code instead of
mkwrite status value.  Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Jan Kara
600be30a8b ext4: remove i_ioend_count
Remove counter of pending io ends as it is unused.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:39:21 -05:00
Jan Kara
109811c20f ext4: simplify io_end handling for AIO DIO
When mapping blocks for direct IO, we allocate io_end structure before
mapping blocks and store pointer to it in the inode. This creates a
requirement that any AIO DIO using io_end must be protected by i_mutex.
This created problems in the past with dioread_nolock mode which was
corrupting io_end pointers. Also io_end is allocated unnecessarily in
case where we don't need to convert any extents (which is a common case
for example when overwriting file).

We fix the problem by allocating io_end only once we return unwritten
extent from block mapping function for AIO DIO (so we can save some
pointless io_end allocations) and we pass pointer to it in bh->b_private
which generic DIO code later passes to our end IO callback. That way we
remove any need for global pointer to io_end structure and thus fix the
races.

The downside of this change is that the checking for unwritten IO in
flight in ext4_extents_can_be_merged() is more racy since we now
increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
i_data_sem. However the check has been racy already before because
ext4_writepages() already increment i_unwritten after dropping
i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
case.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-03-08 23:36:46 -05:00
Jan Kara
efe70c2951 ext4: move trans handling and completion deferal out of _ext4_get_block
There is no need to handle starting of a transaction and deferal of DIO
completion in _ext4_get_block() function. We can move this out to get
block functions for direct IO that need it. That way we can add stricter
checks verifying things work as we expect.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:35:46 -05:00
Jan Kara
705965bd6d ext4: rename and split get blocks functions
Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
describe what it does. Also split out get blocks functions for direct
IO. Later we move functionality from _ext4_get_blocks() there. There's no
functional change in this patch.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:08:10 -05:00
Jan Kara
e142d05263 ext4: use i_mutex to serialize unaligned AIO DIO
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.

The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:44:50 -05:00
Jan Kara
3bd6ad7b68 ext4: pack ioend structure better
On 64-bit architectures we have two 4-byte holes in struct ext4_io_end.
Order entries better to avoid this and thus make the structure occupy
64 instead of 72 bytes for 64-bit architectures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:26:39 -05:00
Al Viro
f93812846f jffs2: reduce the breakage on recovery from halfway failed rename()
d_instantiate(new_dentry, old_inode) is absolutely wrong thing to
do - it will oops if new_dentry used to be positive, for starters.
What we need is d_invalidate() the target and be done with that.

Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-07 23:07:10 -05:00
Al Viro
803c00123a ncpfs: fix a braino in OOM handling in ncp_fill_cache()
Failing to allocate an inode for child means that cache for *parent* is
incompletely populated.  So it's parent directory inode ('dir') that
needs NCPI_DIR_CACHE flag removed, *not* the child inode ('inode', which
is what we'd failed to allocate in the first place).

Fucked-up-in: commit 5e993e25 ("ncpfs: get rid of d_validate() nonsense")
Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org # v3.19
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-07 22:25:16 -05:00
Linus Torvalds
01ffa3df22 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Overlayfs bug fixes.  All marked as -stable material"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: copy new uid/gid into overlayfs runtime inode
  ovl: ignore lower entries when checking purity of non-directory entries
  ovl: fix getcwd() failure after unsuccessful rmdir
  ovl: fix working on distributed fs as lower layer
2016-03-07 15:23:25 -08:00
Ingo Molnar
ec87e1cf7d Linux 4.5-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW3LO0AAoJEHm+PkMAQRiGhewIAIVHA1+qSSXEHTFeuLRuYpiz
 +ptQUIjPJdakWm/XqOnwSG8SWUuD4XL6ysfNmLSZIdqXYBAPpAuwT1UA2FZhz0dN
 soZxMNleAvzHWRDFLqwjVdOVlTxS6CTTdEQNzi+3R0ZCADllsRcuj/GBIY+M8cr6
 LvxK8BnhDU+Au3gZQjaujTMO7fKG6gOq4wKz/U7RIG37A6rwW577kEfLg4ZgFwt9
 RVjsky5mrX9+4l3QFtox9ZC383P/0VZ6+vXwN2QH1/joDK4EvA8pCwsGTyjRJiqi
 fArHbS+mHyAtbPWJmDbVlQ5dkZJAqRgtWBydjQYoC16S4Bwdce2/FbhBiTgEQAo=
 =sqln
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc7' into x86/asm, to pick up SMAP fix

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-07 09:27:30 +01:00
Brian Foster
7f6aff3a29 xfs: only run torn log write detection on dirty logs
XFS uses CRC verification over a sub-range of the head of the log to
detect and handle torn writes. This torn log write detection currently
runs unconditionally at mount time, regardless of whether the log is
dirty or clean. This is problematic in cases where a filesystem might
end up being moved across different, incompatible (i.e., opposite
byte-endianness) architectures.

The problem lies in the fact that log data is not necessarily written in
an architecture independent format. For example, certain bits of data
are written in native endian format. Further, the size of certain log
data structures differs (i.e., struct xlog_rec_header) depending on the
word size of the cpu. This leads to false positive crc verification
errors and ultimately failed mounts when a cleanly unmounted filesystem
is mounted on a system with an incompatible architecture from data that
was written near the head of the log.

Update the log head/tail discovery code to run torn write detection only
when the log is not clean. This means something other than an unmount
record resides at the head of the log and log recovery is imminent. It
is a requirement to run log recovery on the same type of host that had
written the content of the dirty log and therefore CRC failures are
legitimate corruptions in that scenario.

Reported-by: Jan Beulich <JBeulich@suse.com>
Tested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
717bc0ebca xfs: refactor in-core log state update to helper
Once the record at the head of the log is identified and verified, the
in-core log state is updated based on the record. This includes
information such as the current head block and cycle, the start block of
the last record written to the log, the tail lsn, etc.

Once torn write detection is conditional, this logic will need to be
reused. Factor the code to update the in-core log data structures into a
new helper function. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
65b99a08b3 xfs: refactor unmount record detection into helper
Once the mount sequence has identified the head and tail blocks of the
physical log, the record at the head of the log is located and examined
for an unmount record to determine if the log is clean. This currently
occurs after torn write verification of the head region of the log.

This must ultimately be separated from torn write verification and may
need to be called again if the log head is walked back due to a torn
write (to determine whether the new head record is an unmount record).
Separate this logic into a new helper function. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
82ff6cc26e xfs: separate log head record discovery from verification
The code that locates the log record at the head of the log is buried in
the log head verification function. This is fine when torn write
verification occurs unconditionally, but this behavior is problematic
for filesystems that might be moved across systems with different
architectures.

In preparation for separating examination of the log head for unmount
records from torn write detection, lift the record location logic out of
the log verification function and into the caller. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Linus Torvalds
21b27a74ec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fix from Sage Weil:
 "This is a final commit we missed to align the protocol compatibility
  with the feature bits.

  It decodes a few extra fields in two different messages and reports
  EIO when they are used (not yet supported)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
2016-03-06 11:31:13 -08:00
Christoph Hellwig
1ae1602de0 configfs: switch ->default groups to a linked list
Replace the current NULL-terminated array of default groups with a linked
list.  This gets rid of lots of nasty code to size and/or dynamically
allocate the array.

While we're at it also provide a conveniant helper to remove the default
groups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Felipe Balbi <balbi@kernel.org>		[drivers/usb/gadget]
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
2016-03-06 16:11:24 +01:00
Al Viro
6c51e513a3 lookup_dcache(): lift d_alloc() into callers
... and kill need_lookup thing

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-05 20:09:32 -05:00
Al Viro
6583fe22d1 do_last(): reorder and simplify a bit
bugger off on negatives a bit earlier, simplify the tests

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-05 18:14:03 -05:00
Al Viro
05ef1c50e7 Merge branch 'for-linus' into work.lookups
for the sake of namei.c fixes
2016-03-05 18:10:51 -05:00
Linus Torvalds
e5322c5406 Merge branch 'for-linus2' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Round 2 of this.  I cut back to the bare necessities, the patch is
  still larger than it usually would be at this time, due to the number
  of NVMe fixes in there.  This pull request contains:

   - The 4 core fixes from Ming, that fix both problems with exceeding
     the virtual boundary limit in case of merging, and the gap checking
     for cloned bio's.

   - NVMe fixes from Keith and Christoph:

        - Regression on larger user commands, causing problems with
          reading log pages (for instance). This touches both NVMe,
          and the block core since that is now generally utilized also
          for these types of commands.

        - Hot removal fixes.

        - User exploitable issue with passthrough IO commands, if !length
          is given, causing us to fault on writing to the zero
          page.

        - Fix for a hang under error conditions

   - And finally, the current series regression for umount with cgroup
     writeback, where the final flush would happen async and hence open
     up window after umount where the device wasn't consistent.  fsck
     right after umount would show this.  From Tejun"

* 'for-linus2' of git://git.kernel.dk/linux-block:
  block: support large requests in blk_rq_map_user_iov
  block: fix blk_rq_get_max_sectors for driver private requests
  nvme: fix max_segments integer truncation
  nvme: set queue limits for the admin queue
  writeback: flush inode cgroup wb switches instead of pinning super_block
  NVMe: Fix 0-length integrity payload
  NVMe: Don't allow unsupported flags
  NVMe: Move error handling to failed reset handler
  NVMe: Simplify device reset failure
  NVMe: Fix namespace removal deadlock
  NVMe: Use IDA for namespace disk naming
  NVMe: Don't unmap controller registers on reset
  block: merge: get the 1st and last bvec via helpers
  block: get the 1st and last bvec via helpers
  block: check virt boundary in bio_will_gap()
  block: bio: introduce helpers to get the 1st and last bvec
2016-03-04 18:17:17 -08:00
Linus Torvalds
c51797d25d This contains two important JFFS2 fixes marked for stable:
• a lock ordering problem between the page lock and the internal f->sem
    mutex, which was causing occasional deadlocks in garbage collection, and
  • a scan failure causing moved directories to sometimes end up appearing
    to have hard links.
 
 There are also a couple of trivial MAINTAINERS file updates.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iEYEABECAAYFAlbaGIsACgkQdwG7hYl686OpGQCgu0l4E7cQ/v1Af9kZatj6fnzN
 LvcAnR3SzmiH1jxNGSY7C1mUQWosRl/9
 =Ker9
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20160304' of git://git.infradead.org/linux-mtd

Pull jffs2 fixes from David Woodhouse:
 "This contains two important JFFS2 fixes marked for stable:

   - a lock ordering problem between the page lock and the internal
     f->sem mutex, which was causing occasional deadlocks in garbage
     collection

   - a scan failure causing moved directories to sometimes end up
     appearing to have hard links.

  There are also a couple of trivial MAINTAINERS file updates"

* tag 'for-linus-20160304' of git://git.infradead.org/linux-mtd:
  MAINTAINERS: add maintainer entry for FREESCALE GPMI NAND driver
  Fix directory hardlinks from deleted directories
  jffs2: Fix page lock / f->sem deadlock
  Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin"
  MAINTAINERS: update Han's email
2016-03-04 17:36:46 -08:00
Linus Torvalds
2cdcb2b5b5 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "Filipe nailed down a problem where tree log replay would do some work
  that orphan code wasn't expecting to be done yet, leading to BUG_ON"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix loading of orphan roots leading to BUG_ON
2016-03-04 17:31:32 -08:00
Yan, Zheng
5ea5c5e0a7 ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-04 21:00:37 +01:00
Christoph Hellwig
c43c83a294 direct-io: only use block polling if explicitly requested
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Christoph Hellwig
97be7ebe53 vfs: add the RWF_HIPRI flag for preadv2/pwritev2
This adds a flag that tells the file system that this is a high priority
request for which it's worth to poll the hardware.  The flag is purely
advisory and can be ignored if not supported.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Milosz Tanski
f17d8b3545 vfs: vfs: Define new syscalls preadv2,pwritev2
New syscalls that take an flag argument.   No flags are added yet in this
patch.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Christoph Hellwig
793b80ef14 vfs: pass a flags argument to vfs_readv/vfs_writev
This way we can set kiocb flags also from the sync read/write path for
the read_iter/write_iter operations.  For now there is no way to pass
flags to plain read/write operations as there is no real need for that,
and all flags passed are explicitly rejected for these files.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Filipe Manana
909c3a22da Btrfs: fix loading of orphan roots leading to BUG_ON
When looking for orphan roots during mount we can end up hitting a
BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
replayed and qgroups are enabled. This is because after a log tree is
replayed, a transaction commit is made, which triggers qgroup extent
accounting which in turn does backref walking which ends up reading and
inserting all roots in the radix tree fs_info->fs_root_radix, including
orphan roots (deleted snapshots). So after the log tree is replayed, when
finding orphan roots we hit the BUG_ON with the following trace:

[118209.182438] ------------[ cut here ]------------
[118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
[118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G        W       4.5.0-rc5-btrfs-next-24+ #1
[118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000
[118209.186318] RIP: 0010:[<ffffffffa04237d7>]  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318] RSP: 0018:ffff8800af34faa8  EFLAGS: 00010246
[118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001
[118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff
[118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000
[118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000
[118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000
[118209.186318] FS:  00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[118209.186318] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0
[118209.186318] Stack:
[118209.186318]  fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
[118209.186318]  fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
[118209.186318]  0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
[118209.186318] Call Trace:
[118209.186318]  [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
[118209.186318]  [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
[118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318]  [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
[118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318]  [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
[118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318]  [<ffffffff81195637>] do_mount+0x8a6/0x9e8
[118209.186318]  [<ffffffff8119598d>] SyS_mount+0x77/0x9f
[118209.186318]  [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b
4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
[118209.186318] RIP  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318]  RSP <ffff8800af34faa8>
[118209.230735] ---[ end trace 83938f987d85d477 ]---

So fix this by not treating the error -EEXIST, returned when attempting
to insert a root already inserted by the backref walking code, as an error.

The following test case for xfstests reproduces the bug:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_target flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  _run_btrfs_util_prog quota enable $SCRATCH_MNT

  # Create 2 directories with one file in one of them.
  # We use these just to trigger a transaction commit later, moving the file from
  # directory a to directory b and doing an fsync against directory a.
  mkdir $SCRATCH_MNT/a
  mkdir $SCRATCH_MNT/b
  touch $SCRATCH_MNT/a/f
  sync

  # Create our test file with 2 4K extents.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

  # Create a snapshot and delete it. This doesn't really delete the snapshot
  # immediately, just makes it inaccessible and invisible to user space, the
  # snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
  # which is woke up at the next transaction commit.
  # A root orphan item is inserted into the tree of tree roots, so that if a
  # power failure happens before the dedicated kernel thread does the snapshot
  # deletion, the next time the filesystem is mounted it resumes the snapshot
  # deletion.
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
  _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap

  # Now overwrite half of the extents we wrote before. Because we made a snapshpot
  # before, which isn't really deleted yet (since no transaction commit happened
  # after we did the snapshot delete request), the non overwritten extents get
  # referenced twice, once by the default subvolume and once by the snapshot.
  $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

  # Now move file f from directory a to directory b and fsync directory a.
  # The fsync on the directory a triggers a transaction commit (because a file
  # was moved from it to another directory) and the file fsync leaves a log tree
  # with file extent items to replay.
  mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  # Now simulate a power failure and mount the filesystem to replay the log tree.
  # After the log tree was replayed, we used to hit a BUG_ON() when processing
  # the root orphan item for the deleted snapshot. This is because when processing
  # an orphan root the code expected to be the first code inserting the root into
  # the fs_info->fs_root_radix radix tree, while in reallity it was the second
  # caller attempting to do it - the first caller was the transaction commit that
  # took place after replaying the log tree, when updating the qgroup counters.
  _flakey_drop_and_remount

  echo "File digest before after failure:"
  # Must match what he got before the power failure.
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  _unmount_flakey
  status=0
  exit

Fixes: 2d9e977610 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
Cc: stable@vger.kernel.org  # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-03 15:28:59 -08:00
Shaohua Li
3684aa7099 block-dev: enable writeback cgroup support
block_dev's .writepages/.writepage already handles
wbc_init_bio/wbc_account_io. We only set the SB_I_CGROUPWB bit to
suppport writeback cgroup support.

Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:50:53 -07:00
Tejun Heo
a1a0e23e49 writeback: flush inode cgroup wb switches instead of pinning super_block
If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac16 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac16 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac16 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:50 -07:00
Konstantin Khlebnikov
b81de061fa ovl: copy new uid/gid into overlayfs runtime inode
Overlayfs must update uid/gid after chown, otherwise functions
like inode_owner_or_capable() will check user against stale uid.
Catched by xfstests generic/087, it chowns file and calls utimes.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:46 +01:00
Konstantin Khlebnikov
45d1173896 ovl: ignore lower entries when checking purity of non-directory entries
After rename file dentry still holds reference to lower dentry from
previous location. This doesn't matter for data access because data comes
from upper dentry. But this stale lower dentry taints dentry at new
location and turns it into non-pure upper. Such file leaves visible
whiteout entry after remove in directory which shouldn't have whiteouts at
all.

Overlayfs already tracks pureness of file location in oe->opaque.  This
patch just uses that for detecting actual path type.

Comment from Vivek Goyal's patch:

Here are the details of the problem. Do following.

$ mkdir upper lower work merged upper/dir/
$ touch lower/test
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=
work merged
$ mv merged/test merged/dir/
$ rm merged/dir/test
$ ls -l merged/dir/
/usr/bin/ls: cannot access merged/dir/test: No such file or directory
total 0
c????????? ? ? ? ?            ? test

Basic problem seems to be that once a file has been unlinked, a whiteout
has been left behind which was not needed and hence it becomes visible.

Whiteout is visible because parent dir is of not type MERGE, hence
od->is_real is set during ovl_dir_open(). And that means ovl_iterate()
passes on iterate handling directly to underlying fs. Underlying fs does
not know/filter whiteouts so it becomes visible to user.

Why did we leave a whiteout to begin with when we should not have.
ovl_do_remove() checks for OVL_TYPE_PURE_UPPER() and does not leave
whiteout if file is pure upper. In this case file is not found to be pure
upper hence whiteout is left.

So why file was not PURE_UPPER in this case? I think because dentry is
still carrying some leftover state which was valid before rename. For
example, od->numlower was set to 1 as it was a lower file. After rename,
this state is not valid anymore as there is no such file in lower.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Viktor Stanchev <me@viktorstanchev.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109611
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:45 +01:00
Rui Wang
ce9113bbcb ovl: fix getcwd() failure after unsuccessful rmdir
ovl_remove_upper() should do d_drop() only after it successfully
removes the dir, otherwise a subsequent getcwd() system call will
fail, breaking userspace programs.

This is to fix: https://bugzilla.kernel.org/show_bug.cgi?id=110491

Signed-off-by: Rui Wang <rui.y.wang@intel.com>
Reviewed-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:45 +01:00
Konstantin Khlebnikov
b5891cfab0 ovl: fix working on distributed fs as lower layer
This adds missing .d_select_inode into alternative dentry_operations.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Fixes: 7c03b5d45b ("ovl: allow distributed fs as lower layer")
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Tested-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # 4.2+
2016-03-03 17:17:45 +01:00
Linus Torvalds
12f1d7e493 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Various small CIFS/SMB3 fixes for stable:

  Fixes address oops that can occur when accessing Macs with SMB3, and
  another problem found to Samba when read responses queued (e.g. with
  gluster under Samba)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix duplicate line introduced by clone_file_range patch
  Fix cifs_uniqueid_to_ino_t() function for s390x
  CIFS: Fix SMB2+ interim response processing for read requests
  cifs: fix out-of-bounds access in lease parsing
2016-03-02 09:15:21 -08:00
Linus Torvalds
39680f50ae userfaultfd: don't block on the last VM updates at exit time
The exit path will do some final updates to the VM of an exiting process
to inform others of the fact that the process is going away.

That happens, for example, for robust futex state cleanup, but also if
the parent has asked for a TID update when the process exits (we clear
the child tid field in user space).

However, at the time we do those final VM accesses, we've already
stopped accepting signals, so the usual "stop waiting for userfaults on
signal" code in fs/userfaultfd.c no longer works, and the process can
become an unkillable zombie waiting for something that will never
happen.

To solve this, just make handle_userfault() abort any user fault
handling if we're already in the exit path past the signal handling
state being dead (marked by PF_EXITING).

This VM special case is pretty ugly, and it is possible that we should
look at finalizing signals later (or move the VM final accesses
earlier).  But in the meantime this is a fairly minimally intrusive fix.

Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-02 09:03:18 -08:00
Greg Kroah-Hartman
523462df28 Merge 4.5-rc6 into char-misc-next
We want the fixes in here, and others are sending us pull requests based
on this kernel tree.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-01 16:38:16 -08:00
Linus Torvalds
f691b77b1f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull d_inode/d_flags race fix from Al Viro.

I love this fix.  Not only does it fix the race in the dentry type
handling, it entirely gets rid of the nasty and subtle memory ordering
rules for d_type and d_inode, and replaces them with the basic dentry
locking rules (sequence numbers under RCU, d_lock elsewhere).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  use ->d_seq to get coherency between ->d_inode and ->d_flags
2016-03-01 15:30:45 -08:00
Steve French
9589995e46 CIFS: Fix duplicate line introduced by clone_file_range patch
Commit 04b38d6012 ("vfs: pull btrfs clone API to vfs layer")
added a duplicated line (in cifsfs.c) which causes a sparse compile
warning.

Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-03-01 09:38:00 -06:00
Al Viro
a528aca7f3 use ->d_seq to get coherency between ->d_inode and ->d_flags
Games with ordering and barriers are way too brittle.  Just
bump ->d_seq before and after updating ->d_inode and ->d_flags
type bits, so that verifying ->d_seq would guarantee they are
coherent.

Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-29 12:16:43 -05:00
Yadan Fan
1ee9f4bd1a Fix cifs_uniqueid_to_ino_t() function for s390x
This issue is caused by commit 02323db17e ("cifs: fix
cifs_uniqueid_to_ino_t not to ever return 0"), when BITS_PER_LONG
is 64 on s390x, the corresponding cifs_uniqueid_to_ino_t()
function will cast 64-bit fileid to 32-bit by using (ino_t)fileid,
because ino_t (typdefed __kernel_ino_t) is int type.

It's defined in arch/s390/include/uapi/asm/posix_types.h

    #ifndef __s390x__

    typedef unsigned long   __kernel_ino_t;
    ...
    #else /* __s390x__ */

    typedef unsigned int    __kernel_ino_t;

So the #ifdef condition is wrong for s390x, we can just still use
one cifs_uniqueid_to_ino_t() function with comparing sizeof(ino_t)
and sizeof(u64) to choose the correct execution accordingly.

Signed-off-by: Yadan Fan <ydfan@suse.com>
CC: stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-29 00:46:55 -06:00
Pavel Shilovsky
6cc3b24235 CIFS: Fix SMB2+ interim response processing for read requests
For interim responses we only need to parse a header and update
a number credits. Now it is done for all SMB2+ command except
SMB2_READ which is wrong. Fix this by adding such processing.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-29 00:21:36 -06:00
Justin Maggard
deb7deff2f cifs: fix out-of-bounds access in lease parsing
When opening a file, SMB2_open() attempts to parse the lease state from the
SMB2 CREATE Response.  However, the parsing code was not careful to ensure
that the create contexts are not empty or invalid, which can lead to out-
of-bounds memory access.  This can be seen easily by trying
to read a file from a OSX 10.11 SMB3 server.  Here is sample crash output:

BUG: unable to handle kernel paging request at ffff8800a1a77cc6
IP: [<ffffffff8828a734>] SMB2_open+0x804/0x960
PGD 8f77067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 3 PID: 2876 Comm: cp Not tainted 4.5.0-rc3.x86_64.1+ #14
Hardware name: NETGEAR ReadyNAS 314          /ReadyNAS 314          , BIOS 4.6.5 10/11/2012
task: ffff880073cdc080 ti: ffff88005b31c000 task.ti: ffff88005b31c000
RIP: 0010:[<ffffffff8828a734>]  [<ffffffff8828a734>] SMB2_open+0x804/0x960
RSP: 0018:ffff88005b31fa08  EFLAGS: 00010282
RAX: 0000000000000015 RBX: 0000000000000000 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff88007eb8c8b0
RBP: ffff88005b31fad8 R08: 666666203d206363 R09: 6131613030383866
R10: 3030383866666666 R11: 00000000000002b0 R12: ffff8800660fd800
R13: ffff8800a1a77cc2 R14: 00000000424d53fe R15: ffff88005f5a28c0
FS:  00007f7c8a2897c0(0000) GS:ffff88007eb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8800a1a77cc6 CR3: 000000005b281000 CR4: 00000000000006e0
Stack:
 ffff88005b31fa70 ffffffff88278789 00000000000001d3 ffff88005f5a2a80
 ffffffff00000003 ffff88005d029d00 ffff88006fde05a0 0000000000000000
 ffff88005b31fc78 ffff88006fde0780 ffff88005b31fb2f 0000000100000fe0
Call Trace:
 [<ffffffff88278789>] ? cifsConvertToUTF16+0x159/0x2d0
 [<ffffffff8828cf68>] smb2_open_file+0x98/0x210
 [<ffffffff8811e80c>] ? __kmalloc+0x1c/0xe0
 [<ffffffff882685f4>] cifs_open+0x2a4/0x720
 [<ffffffff88122cef>] do_dentry_open+0x1ff/0x310
 [<ffffffff88268350>] ? cifsFileInfo_get+0x30/0x30
 [<ffffffff88123d92>] vfs_open+0x52/0x60
 [<ffffffff88131dd0>] path_openat+0x170/0xf70
 [<ffffffff88097d48>] ? remove_wait_queue+0x48/0x50
 [<ffffffff88133a29>] do_filp_open+0x79/0xd0
 [<ffffffff8813f2ca>] ? __alloc_fd+0x3a/0x170
 [<ffffffff881240c4>] do_sys_open+0x114/0x1e0
 [<ffffffff881241a9>] SyS_open+0x19/0x20
 [<ffffffff8896e257>] entry_SYSCALL_64_fastpath+0x12/0x6a
Code: 4d 8d 6c 07 04 31 c0 4c 89 ee e8 47 6f e5 ff 31 c9 41 89 ce 44 89 f1 48 c7 c7 28 b1 bd 88 31 c0 49 01 cd 4c 89 ee e8 2b 6f e5 ff <45> 0f b7 75 04 48 c7 c7 31 b1 bd 88 31 c0 4d 01 ee 4c 89 f6 e8
RIP  [<ffffffff8828a734>] SMB2_open+0x804/0x960
 RSP <ffff88005b31fa08>
CR2: ffff8800a1a77cc6
---[ end trace d9f69ba64feee469 ]---

Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2016-02-29 00:21:31 -06:00
Linus Torvalds
12b9fa6a97 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  do_last(): ELOOP failure exit should be done after leaving RCU mode
  should_follow_link(): validate ->d_seq after having decided to follow
  namei: ->d_inode of a pinned dentry is stable only for positives
  do_last(): don't let a bogus return value from ->open() et.al. to confuse us
  fs: return -EOPNOTSUPP if clone is not supported
  hpfs: don't truncate the file when delete fails
2016-02-27 17:10:32 -08:00
Al Viro
5129fa482b do_last(): ELOOP failure exit should be done after leaving RCU mode
... or we risk seeing a bogus value of d_is_symlink() there.

Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27 19:37:37 -05:00
Al Viro
a7f775428b should_follow_link(): validate ->d_seq after having decided to follow
... otherwise d_is_symlink() above might have nothing to do with
the inode value we've got.

Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27 19:31:01 -05:00
Al Viro
d4565649b6 namei: ->d_inode of a pinned dentry is stable only for positives
both do_last() and walk_component() risk picking a NULL inode out
of dentry about to become positive, *then* checking its flags and
seeing that it's not negative anymore and using (already stale by
then) value they'd fetched earlier.  Usually ends up oopsing soon
after that...

Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27 19:23:16 -05:00
Al Viro
c80567c82a do_last(): don't let a bogus return value from ->open() et.al. to confuse us
... into returning a positive to path_openat(), which would interpret that
as "symlink had been encountered" and proceed to corrupt memory, etc.
It can only happen due to a bug in some ->open() instance or in some LSM
hook, etc., so we report any such event *and* make sure it doesn't trick
us into further unpleasantness.

Cc: stable@vger.kernel.org # v3.6+, at least
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27 19:17:33 -05:00
Christoph Hellwig
0fcbf996d8 fs: return -EOPNOTSUPP if clone is not supported
-EBADF is a rather confusing error if an operations is not supported,
and nfsd gets rather upset about it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27 19:15:51 -05:00
Mikulas Patocka
b6853f78e7 hpfs: don't truncate the file when delete fails
The delete opration can allocate additional space on the HPFS filesystem
due to btree split. The HPFS driver checks in advance if there is
available space, so that it won't corrupt the btree if we run out of space
during splitting.

If there is not enough available space, the HPFS driver attempted to
truncate the file, but this results in a deadlock since the commit
7dd29d8d86 ("HPFS: Introduce a global mutex
and lock it on every callback from VFS").

This patch removes the code that tries to truncate the file and -ENOSPC is
returned instead. If the user hits -ENOSPC on delete, he should try to
delete other files (that are stored in a leaf btree node), so that the
delete operation will make some space for deleting the file stored in
non-leaf btree node.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@vger.kernel.org	# 2.6.39+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27 19:15:51 -05:00
Linus Torvalds
691429e13d Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  dax: move writeback calls into the filesystems
  dax: give DAX clearing code correct bdev
  ext4: online defrag not supported with DAX
  ext2, ext4: only set S_DAX for regular inodes
  block: disable block device DAX by default
  ocfs2: unlock inode if deleting inode from orphan fails
  mm: ASLR: use get_random_long()
  drivers: char: random: add get_random_long()
  mm: numa: quickly fail allocations for NUMA balancing on full nodes
  mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
2016-02-27 12:46:16 -08:00
Linus Torvalds
1c271479b5 This fixes a file system corruption bug with DAX
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJW0fNWAAoJEPL5WVaVDYGjbroIAJKZBmlMh5DADLWIUNqeo6y+
 U8O0hpPUyoYb/j0wTVBe5z4cyfWjl1idrA4ZIb2VgMB28F8pPxuLifTVMx0kLeO9
 B1rcqn7CTzwmU9nj6yjcBkYp/spR8lBzaHq2REm3lE9Jwf6NdD4uwhzPiNmL3+xR
 dcg7lFzS6PSmLYD3mhb/lD5/3D3sYDlZ4nmX7uEl5WgxYaB1j5zsBVzYDU2Q0jZZ
 s+r/kj1eL8i9EnZ4zgZ4Bvtjm0jy5iVhO2YvLNQUZDEgmvJpNbVSBv/wAWoe9N3U
 rnm65s8F5hRbc2c8w4M43074uuEA4p0zZwR2z1E6RZvFZsl4Z5kk0/YEmxF7N6g=
 =IALP
 -----END PGP SIGNATURE-----

Merge tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext2/4 DAX fix from Ted Ts'o:
 "This fixes a file system corruption bug with DAX"

* tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
2016-02-27 12:40:49 -08:00
Ross Zwisler
1e9d180ba3 ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry.  For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect.  The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.

Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().

Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-27 14:01:16 -05:00
Ross Zwisler
7f6d5b529b dax: move writeback calls into the filesystems
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device.  This also fixes DAX code to properly flush caches in response
to sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Ross Zwisler
20a90f5899 dax: give DAX clearing code correct bdev
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases.  This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.

Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block.  This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Ross Zwisler
73f34a5e2c ext4: online defrag not supported with DAX
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()

When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption.  This was observed with xfstests ext4/307 and
ext4/308.

Fix this by only allowing online defrag for non-DAX files.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Ross Zwisler
0a6cf9137d ext2, ext4: only set S_DAX for regular inodes
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes.  Any dirty data associated
with the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().

With the current code, though, this isn't always true.  For example,
ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
data stored as dirty page cache entries.  For these types of inodes,
having S_DAX set doesn't really make sense since their I/O doesn't
actually happen through the DAX code path.

Instead, only allow S_DAX to be set for regular inodes for ext2 and
ext4.  This allows us to have strict DAX vs non-DAX paths in the
writeback code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Dan Williams
03cdadb040 block: disable block device DAX by default
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings.  This can lead to data corruption.

We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.

   dump_stack+0x85/0xc2
   dax_writeback_mapping_range+0x60/0xe0
   blkdev_writepages+0x3f/0x50
   do_writepages+0x21/0x30
   __filemap_fdatawrite_range+0xc6/0x100
   filemap_write_and_wait+0x4a/0xa0
   set_blocksize+0x70/0xd0
   sb_set_blocksize+0x1d/0x50
   ext4_fill_super+0x75b/0x3360
   mount_bdev+0x180/0x1b0
   ext4_mount+0x15/0x20
   mount_fs+0x38/0x170

Mark the support broken so its disabled by default, but otherwise still
available for testing.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Guozhonghua
a4a8481ff6 ocfs2: unlock inode if deleting inode from orphan fails
When doing append direct io cleanup, if deleting inode fails, it goes
out without unlocking inode, which will cause the inode deadlock.

This issue was introduced by commit cf1776a9e8 ("ocfs2: fix a tiny
race when truncate dio orohaned entry").

Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>	[4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Daniel Cashman
5ef11c35ce mm: ASLR: use get_random_long()
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long().  Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.

Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Deepa Dinamani
b1f1a29d8f configfs: Replace CURRENT_TIME by current_fs_time()
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-02-26 10:42:35 +01:00
David Woodhouse
be629c62a6 Fix directory hardlinks from deleted directories
When a directory is deleted, we don't take too much care about killing off
all the dirents that belong to it — on the basis that on remount, the scan
will conclude that the directory is dead anyway.

This doesn't work though, when the deleted directory contained a child
directory which was moved *out*. In the early stages of the fs build
we can then end up with an apparent hard link, with the child directory
appearing both in its true location, and as a child of the original
directory which are this stage of the mount process we don't *yet* know
is defunct.

To resolve this, take out the early special-casing of the "directories
shall not have hard links" rule in jffs2_build_inode_pass1(), and let the
normal nlink processing happen for directories as well as other inodes.

Then later in the build process we can set ic->pino_nlink to the parent
inode#, as is required for directories during normal operaton, instead
of the nlink. And complain only *then* about hard links which are still
in evidence even after killing off all the unreachable paths.

Reported-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
2016-02-25 11:11:28 +00:00
David Woodhouse
49e91e7079 jffs2: Fix page lock / f->sem deadlock
With this fix, all code paths should now be obtaining the page lock before
f->sem.

Reported-by: Szabó Tamás <sztomi89@gmail.com>
Tested-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
2016-02-25 11:11:26 +00:00
Thomas Betker
157078f64b Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin"
This reverts commit 5ffd3412ae
("jffs2: Fix lock acquisition order bug in jffs2_write_begin").

The commit modified jffs2_write_begin() to remove a deadlock with
jffs2_garbage_collect_live(), but this introduced new deadlocks found
by multiple users. page_lock() actually has to be called before
mutex_lock(&c->alloc_sem) or mutex_lock(&f->sem) because
jffs2_write_end() and jffs2_readpage() are called with the page locked,
and they acquire c->alloc_sem and f->sem, resp.

In other words, the lock order in jffs2_write_begin() was correct, and
it is the jffs2_garbage_collect_live() path that has to be changed.

Revert the commit to get rid of the new deadlocks, and to clear the way
for a better fix of the original deadlock.

Reported-by: Deng Chao <deng.chao1@zte.com.cn>
Reported-by: Ming Liu <liu.ming50@gmail.com>
Reported-by: wangzaiwei <wangzaiwei@top-vision.cn>
Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
2016-02-25 11:11:25 +00:00
Linus Torvalds
aa263c43fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes - xattr one from this cycle, the rest - stable fodder"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/pnode.c: treat zero mnt_group_id-s as unequal
  affs_do_readpage_ofs(): just use kmap_atomic() around memcpy()
  xattr handlers: plug a lock leak in simple_xattr_list
  fs: allow no_seek_end_llseek to actually seek
2016-02-24 14:00:26 -08:00
Linus Torvalds
420eb6d7ef NFS client bugfixes for Linux 4.5
Stable bugfixes:
 - Fix nfs_size_to_loff_t
 - NFSv4: Fix a dentry leak on alias use
 
 Other bugfixes:
 - Don't schedule a layoutreturn if the layout segment can be freed immediately.
 - Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
 - rpcrdma_bc_receive_call() should init rq_private_buf.len
 - fix stateid handling for the NFS v4.2 operations
 - pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
 - fix panic in gss_pipe_downcall() in fips mode
 - Fix a race between layoutget and pnfs_destroy_layout
 - Fix a race between layoutget and bulk recalls
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWzKShAAoJEGcL54qWCgDyN0QQALiX8v2wvn07vE5ZeXB5uONq
 +mfx8avhEoc3NVrpG6F4Kj+yJmHeAbkgIygnhZn4tcM/2YRxGDwlVLHb++yUTHO9
 8zEi+tiKx9f5pK2PxRQ0PjavVxO/xOyO0/QNrUdnj8hSNR9ow+YOVjEYUulbuhIg
 VAI3oSy5qIKgtDyW7w5PuPpTXLo74hPmyqHaa+ZIr2et//nJMSsw++vAmSg3oqXq
 6QkLWPHt/8yvDRRn2hKkbD9gOrFCVfaZIGLM6Q0zRWAcGTzJi94ELzPdm8cVpD1o
 eXKcufgLXPt3GOeAmxZ9kwQeebR6IFcvkYom5dsPhtMBuzXu1wpanU8PGgYIQ0VA
 88b2YNl+TZpiVbRzxSEellZq5b+zapH/VVVnYptZiq9wUTACc7jK6W2heqe5PzaT
 iepTGCAE21tV5JewcITMQHDZiOjRNdtbBzgixI7pNfMN8whU6e5NHYj6psZqT7cf
 xEEZzL+RBJuCFKhXSPbBefccA4HCRkDEpT+2QgrMbS4KKfWOg36UNbJ2kgbvcRVi
 HTqoRONR6zMzYBhyMlLaUuJ1co8nSHgEsL81Q3MwWSY6gucSW7jeJ2stR20KJIo1
 7qgod9Ac/BAIozjzywi0LtmxouPyPU8cqaboMhSRVPDKfFlqZBNBkFLNWwgoYXMa
 r1afZQwNeRRbZUR3RulE
 =/WDS
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Stable bugfixes:
   - Fix nfs_size_to_loff_t
   - NFSv4: Fix a dentry leak on alias use

  Other bugfixes:
   - Don't schedule a layoutreturn if the layout segment can be freed
     immediately.
   - Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
   - rpcrdma_bc_receive_call() should init rq_private_buf.len
   - fix stateid handling for the NFS v4.2 operations
   - pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
   - fix panic in gss_pipe_downcall() in fips mode
   - Fix a race between layoutget and pnfs_destroy_layout
   - Fix a race between layoutget and bulk recalls"

* tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls
  NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout
  auth_gss: fix panic in gss_pipe_downcall() in fips mode
  pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
  nfs4: fix stateid handling for the NFS v4.2 operations
  NFSv4: Fix a dentry leak on alias use
  xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len
  pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
  pNFS: Fix pnfs_mark_matching_lsegs_return()
  nfs: fix nfs_size_to_loff_t
2016-02-23 16:39:21 -08:00
Jan Kara
cb0d9d47a3 jbd2: save some atomic ops in __JI_COMMIT_RUNNING handling
Currently we used atomic bit operations to manipulate
__JI_COMMIT_RUNNING bit. However this is unnecessary as i_flags are
always written and read under j_list_lock. So just change the operations
to standard bit operations.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 23:20:30 -05:00
Jan Kara
1101cd4d13 jbd2: unify revoke and tag block checksum handling
Revoke and tag descriptor blocks are just different kinds of descriptor
blocks and thus have checksum in the same place. Unify computation and
checking of checksums for these.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 23:19:09 -05:00
Jan Kara
32ab671599 jbd2: factor out common descriptor block initialization
Descriptor block header is initialized in several places. Factor out the
common code into jbd2_journal_get_descriptor_buffer().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 23:17:15 -05:00
Jan Kara
9bcf976cb8 jbd2: remove unnecessary arguments of jbd2_journal_write_revoke_records
jbd2_journal_write_revoke_records() takes journal pointer and write_op,
although journal can be obtained from the passed transaction and
write_op is always WRITE_SYNC. Remove these superfluous arguments.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 23:07:30 -05:00
Eric Whitney
29c6eaffc8 ext4: trim unused parameter from convert_initialized_extent()
The flags parameter is also unused.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:58:55 -05:00
Andreas Gruenbacher
6048c64b26 mbcache: add reusable flag to cache entries
To reduce amount of damage caused by single bad block, we limit number
of inodes sharing an xattr block to 1024. Thus there can be more xattr
blocks with the same contents when there are lots of files with the same
extended attributes. These xattr blocks naturally result in hash
collisions and can form long hash chains and we unnecessarily check each
such block only to find out we cannot use it because it is already
shared by too many inodes.

Add a reusable flag to cache entries which is cleared when a cache entry
has reached its maximum refcount.  Cache entries which are not marked
reusable are skipped by mb_cache_entry_find_{first,next}. This
significantly speeds up mbcache when there are many same xattr blocks.
For example for xattr-bench with 5 values and each process handling
20000 files, the run for 64 processes is 25x faster with this patch.
Even for 8 processes the speedup is almost 3x. We have also verified
that for situations where there is only one xattr block of each kind,
the patch doesn't have a measurable cost.

[JK: Remove handling of setting the same value since it is not needed
anymore, check for races in e_reusable setting, improve changelog,
add measurements]

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:44:04 -05:00
Jan Kara
3fd164629d ext4: shortcut setting of xattr to the same value
When someone tried to set xattr to the same value (i.e., not changing
anything) we did all the work of removing original xattr, possibly
breaking references to shared xattr block, inserting new xattr, and
merging xattr blocks again. Since this is not so rare operation and it
is relatively cheap for us to detect this case, check for this and
shortcut xattr setting in that case.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:43:04 -05:00
Andreas Gruenbacher
dc8d5e565f mbcache: get rid of _e_hash_list_head
Get rid of field _e_hash_list_head in cache entries and add bit field
e_referenced instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:42:05 -05:00
Andreas Gruenbacher
2335d05f3a ext4: kill ext4_mballoc_ready
This variable, introduced in commit 9c191f70, is unnecessary: it is set
once the module has been initialized correctly, and ext4_fill_super
cannot run unless the module has been initialized correctly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:41:05 -05:00
Jan Kara
7a2508e1b6 mbcache2: rename to mbcache
Since old mbcache code is gone, let's rename new code to mbcache since
number 2 is now meaningless. This is just a mechanical replacement.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:35:22 -05:00
Jan Kara
f0c8b46238 mbcache2: Use referenced bit instead of LRU
Currently we maintain perfect LRU list by moving entry to the tail of
the list when it gets used. However these operations on cache-global
list are relatively expensive.

In this patch we switch to lazy updates of LRU list. Whenever entry gets
used, we set a referenced bit in it. When reclaiming entries, we give
referenced entries another round in the LRU. Since the list is not a
real LRU anymore, rename it to just 'list'.

In my testing this logic gives about 30% boost to workloads with mostly
unique xattr blocks (e.g. xattr-bench with 10 files and 10000 unique
xattr values).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 18:23:47 -05:00
Al Viro
97c3160607 ecryptfs_encrypt_and_encode_filename(): drop unused argument
the last time it was getting something other than NULL as
crypt_stat had been back in 2009...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-22 18:14:25 -05:00
Al Viro
6b719e5309 ecryptfs_lookup(): use lookup_one_len_unlocked()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-22 17:48:19 -05:00
Trond Myklebust
9fd4b9fc76 NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls
Replace another case where the layout 'plh_block_lgets' can trigger
infinite loops in send_layoutget().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-22 17:46:34 -05:00
Trond Myklebust
2454dfea0a NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout
If the server reboots while there is a layoutget outstanding, then
the call to pnfs_choose_layoutget_stateid() will fail with an EAGAIN
error, which causes an infinite loop in send_layoutget(). The reason
why we never break out of the loop is that the layout 'plh_block_lgets'
field is never cleared.

Fix is to replace plh_block_lgets with NFS_LAYOUT_INVALID_STID, which
can be reset after a new layoutget.

Fixes: ab7d763e47 ("pNFS: Ensure nfs4_layoutget_prepare returns...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-22 17:34:59 -05:00
Bob Peterson
b81171cb68 DLM: Save and restore socket callbacks properly
This patch fixes the problems with patch b3a5bbfd7.

1. It removes a return statement from lowcomms_error_report
   because it needs to call the original error report in all paths
   through the function.
2. All socket callbacks are saved and restored, not just the
   sk_error_report, and that's done so with proper locking like
   sunrpc does.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-02-22 14:02:17 -06:00
Bob Peterson
1a31833d08 DLM: Replace nodeid_to_addr with kernel_getpeername
This patch replaces the call to nodeid_to_addr with a call to
kernel_getpeername. This avoids taking a spinlock because it may
potentially be called from a softirq context.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-02-22 14:02:11 -06:00
Jan Kara
c2f3140fe2 mbcache2: limit cache size
So far number of entries in mbcache is limited only by the pressure from
the shrinker. Since too many entries degrade the hash table and
generally we expect that caching more entries has diminishing returns,
limit number of entries the same way as in the old mbcache to 16 * hash
table size.

Once we exceed the desired maximum number of entries, we schedule a
backround work to reclaim entries. If the background work cannot keep up
and the number of entries exceeds two times the desired maximum, we
reclaim some entries directly when allocating a new entry.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 12:33:03 -05:00
Jan Kara
ecd1e64412 mbcache: remove mbcache
Both ext2 and ext4 are now converted to mbcache2. Remove the old mbcache
code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 12:21:14 -05:00
Jan Kara
be0726d33c ext2: convert to mbcache2
The conversion is generally straightforward. We convert filesystem from
a global cache to per-fs one. Similarly to ext4 the tricky part is that
xattr block corresponding to found mbcache entry can get freed before we
get buffer lock for that block. So we have to check whether the entry is
still valid after getting the buffer lock.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 11:56:38 -05:00
Jan Kara
82939d7999 ext4: convert to mbcache2
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 11:50:13 -05:00
Jan Kara
f9a61eb4e2 mbcache2: reimplement mbcache
Original mbcache was designed to have more features than what ext?
filesystems ended up using. It supported entry being in more hashes, it
had a home-grown rwlocking of each entry, and one cache could cache
entries from multiple filesystems. This genericity also resulted in more
complex locking, larger cache entries, and generally more code
complexity.

This is reimplementation of the mbcache functionality to exactly fit the
purpose ext? filesystems use it for. Cache entries are now considerably
smaller (7 instead of 13 longs), the code is considerably smaller as
well (414 vs 913 lines of code), and IMO also simpler. The new code is
also much more lightweight.

I have measured the speed using artificial xattr-bench benchmark, which
spawns P processes, each process sets xattr for F different files, and
the value of xattr is randomly chosen from a pool of V values. Averages
of runtimes for 5 runs for various combinations of parameters are below.
The first value in each cell is old mbache, the second value is the new
mbcache.

V=10
F\P	1		2		4		8		16		32		64
10	0.158,0.157	0.208,0.196	0.500,0.277	0.798,0.400	3.258,0.584	13.807,1.047	61.339,2.803
100	0.172,0.167	0.279,0.222	0.520,0.275	0.825,0.341	2.981,0.505	12.022,1.202	44.641,2.943
1000	0.185,0.174	0.297,0.239	0.445,0.283	0.767,0.340	2.329,0.480	6.342,1.198	16.440,3.888

V=100
F\P	1		2		4		8		16		32		64
10	0.162,0.153	0.200,0.186	0.362,0.257	0.671,0.496	1.433,0.943	3.801,1.345	7.938,2.501
100	0.153,0.160	0.221,0.199	0.404,0.264	0.945,0.379	1.556,0.485	3.761,1.156	7.901,2.484
1000	0.215,0.191	0.303,0.246	0.471,0.288	0.960,0.347	1.647,0.479	3.916,1.176	8.058,3.160

V=1000
F\P	1		2		4		8		16		32		64
10	0.151,0.129	0.210,0.163	0.326,0.245	0.685,0.521	1.284,0.859	3.087,2.251	6.451,4.801
100	0.154,0.153	0.211,0.191	0.276,0.282	0.687,0.506	1.202,0.877	3.259,1.954	8.738,2.887
1000	0.145,0.179	0.202,0.222	0.449,0.319	0.899,0.333	1.577,0.524	4.221,1.240	9.782,3.579

V=10000
F\P	1		2		4		8		16		32		64
10	0.161,0.154	0.198,0.190	0.296,0.256	0.662,0.480	1.192,0.818	2.989,2.200	6.362,4.746
100	0.176,0.174	0.236,0.203	0.326,0.255	0.696,0.511	1.183,0.855	4.205,3.444	19.510,17.760
1000	0.199,0.183	0.240,0.227	1.159,1.014	2.286,2.154	6.023,6.039	---,10.933	---,36.620

V=100000
F\P	1		2		4		8		16		32		64
10	0.171,0.162	0.204,0.198	0.285,0.230	0.692,0.500	1.225,0.881	2.990,2.243	6.379,4.771
100	0.151,0.171	0.220,0.210	0.295,0.255	0.720,0.518	1.226,0.844	3.423,2.831	19.234,17.544
1000	0.192,0.189	0.249,0.225	1.162,1.043	2.257,2.093	5.853,4.997	---,10.399	---,32.198

We see that the new code is faster in pretty much all the cases and
starting from 4 processes there are significant gains with the new code
resulting in upto 20-times shorter runtimes. Also for large numbers of
cached entries all values for the old code could not be measured as the
kernel started hitting softlockups and died before the test completed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 11:49:09 -05:00
Eryu Guan
87f9a031af ext4: iterate over buffer heads correctly in move_extent_per_page()
In commit bcff24887d ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.

Fixes: bcff24887d ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-02-21 18:38:44 -05:00
Daeho Jeong
f96c450dab ext4: make sure to revoke all the freeable blocks in ext4_free_blocks
Now, ext4_free_blocks() doesn't revoke data blocks of per-file data
journalled inode and it can cause file data inconsistency problems.
Even though data blocks of per-file data journalled inode are already
forgotten by jbd2_journal_invalidatepage() in advance of invoking
ext4_free_blocks(), we still need to revoke the data blocks here.
Moreover some of the metadata blocks, which are not found by
sb_find_get_block(), are still needed to be revoked, but this is also
missing here.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-02-21 18:31:41 -05:00
Mimi Zohar
b844f0ecbc vfs: define kernel_copy_file_from_fd()
This patch defines kernel_read_file_from_fd(), a wrapper for the VFS
common kernel_read_file().

Changelog:
- Separated from the kernel modules patch
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2016-02-21 09:06:10 -05:00
Mimi Zohar
39eeb4fb97 security: define kernel_read_file hook
The kernel_read_file security hook is called prior to reading the file
into memory.

Changelog v4+:
- export security_kernel_read_file()

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
2016-02-21 09:06:09 -05:00
Mimi Zohar
09596b94f7 vfs: define kernel_read_file_from_path
This patch defines kernel_read_file_from_path(), a wrapper for the VFS
common kernel_read_file().

Changelog:
- revert error msg regression - reported by Sergey Senozhatsky
- Separated from the IMA patch

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-21 08:55:00 -05:00
Linus Torvalds
0389075ecf Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "This is unusually large, partly due to the EFI fixes that prevent
  accidental deletion of EFI variables through efivarfs that may brick
  machines.  These fixes are somewhat involved to maintain compatibility
  with existing install methods and other usage modes, while trying to
  turn off the 'rm -rf' bricking vector.

  Other fixes are for large page ioremap()s and for non-temporal
  user-memcpy()s"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix vmalloc_fault() to handle large pages properly
  hpet: Drop stale URLs
  x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()
  x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable
  lib/ucs2_string: Correct ucs2 -> utf8 conversion
  efi: Add pstore variables to the deletion whitelist
  efi: Make efivarfs entries immutable by default
  efi: Make our variable validation list include the guid
  efi: Do variable name validation tests in utf8
  efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version
  lib/ucs2_string: Add ucs2 -> utf8 helper functions
2016-02-20 09:32:40 -08:00
Maxim Patlasov
7ae8fd0351 fs/pnode.c: treat zero mnt_group_id-s as unequal
propagate_one(m) calculates "type" argument for copy_tree() like this:

>    if (m->mnt_group_id == last_dest->mnt_group_id) {
>        type = CL_MAKE_SHARED;
>    } else {
>        type = CL_SLAVE;
>        if (IS_MNT_SHARED(m))
>           type |= CL_MAKE_SHARED;
>   }

The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.

The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:

>    last_source = last_source->mnt_master;
>    last_dest = last_source->mnt_parent;

proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.

[AV: don't open-code an obvious helper...]

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:52 -05:00
Al Viro
0bacbe528e affs_do_readpage_ofs(): just use kmap_atomic() around memcpy()
It forgets kunmap() on a failure exit, but there's really no point keeping
the page kmapped at all - after all, what we are doing is a bunch of memcpy()
into the parts of page, so kmap_atomic()/kunmap_atomic() just around those
memcpy() is enough.

Spotted-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:51 -05:00
Mateusz Guzik
0e9a7da51b xattr handlers: plug a lock leak in simple_xattr_list
The code could leak xattrs->lock on error.

Problem introduced with 786534b92f "tmpfs: listxattr should
include POSIX ACL xattrs".

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:51 -05:00
Wouter van Kesteren
2feb55f890 fs: allow no_seek_end_llseek to actually seek
The user-visible impact of the issue is for example that without this
patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid.

'~0ULL' is a 'unsigned long long' that when converted to a loff_t,
which is signed, gets turned into -1. later in vfs_setpos we have
'if (offset > maxsize)', which makes it always return EINVAL.

Fixes: b25472f9b9 ("new helpers: no_seek_end_llseek{,_size}()")
Signed-off-by: Wouter van Kesteren <woutershep@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:50 -05:00
Linus Torvalds
020ecbba05 Miscellaneous ext4 bug fixes for v4.5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJWx2ZiAAoJEPL5WVaVDYGjrbcH/2EqCUDmW+FqVqR7PkpQsNiV
 WxBTNkxnVXf1Jin5beIUN/Ehq0GSuqcSujMwdbFUa0i7YJNVEe++hTw28JmFILYV
 5nZtTYmYIq7dZb/tnc3tj0SsDpgEE1h31VyWAu4W2q4wSQMDc8AqGM90VktgrerJ
 H9k/WDDL6KC8uXagBsQC0d5xaQglJNZC+S6pSBbMegBAFNJqAL5N78oWAoEFN3OH
 LN3B3eccxBx98rGWx8DBiugY8ZDRHB4Cre+fXu8wmAuMb/+Y7Mwj4RzI+fz5Vpiw
 vMS5RqZ7PvCaMhdyUWt9bI8j10bBXcaxOHL2UQND5A1zundJ1ZNOY/ZPvHVUS4s=
 =AXFu
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Miscellaneous ext4 bug fixes for v4.5"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix crashes in dioread_nolock mode
  ext4: fix bh->b_state corruption
  ext4: fix memleak in ext4_readdir()
  ext4: remove unused parameter "newblock" in convert_initialized_extent()
  ext4: don't read blocks from disk after extents being swapped
  ext4: fix potential integer overflow
  ext4: add a line break for proc mb_groups display
  ext4: ioctl: fix erroneous return value
  ext4: fix scheduling in atomic on group checksum failure
  ext4 crypto: move context consistency check to ext4_file_open()
  ext4 crypto: revalidate dentry after adding or removing the key
2016-02-19 13:44:12 -08:00
Linus Torvalds
ce6b71432d Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "My for-linus-4.5 branch has a btrfs DIO error passing fix.

  I know how much you love DIO, so I'm going to suggest against reading
  it.  We'll follow up with a patch to drop the error arg from
  dio_end_io in the next merge window."

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix direct IO requests not reporting IO error to user space
2016-02-19 13:40:42 -08:00
Linus Torvalds
87d9ac712b Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: slab: free kmem_cache_node after destroy sysfs file
  ipc/shm: handle removed segments gracefully in shm_mmap()
  MAINTAINERS: update Kselftest Framework mailing list
  devm_memremap_release(): fix memremap'd addr handling
  mm/hugetlb.c: fix incorrect proc nr_hugepages value
  mm, x86: fix pte_page() crash in gup_pte_range()
  fsnotify: turn fsnotify reaper thread into a workqueue job
  Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
  mm: fix regression in remap_file_pages() emulation
  thp, dax: do not try to withdraw pgtable from non-anon VMA
2016-02-19 13:36:00 -08:00
Jan Kara
74dae42785 ext4: fix crashes in dioread_nolock mode
Competing overwrite DIO in dioread_nolock mode will just overwrite
pointer to io_end in the inode. This may result in data corruption or
extent conversion happening from IO completion interrupt because we
don't properly set buffer_defer_completion() when unlocked DIO races
with locked DIO to unwritten extent.

Since unlocked DIO doesn't need io_end for anything, just avoid
allocating it and corrupting pointer from inode for locked DIO.
A cleaner fix would be to avoid these games with io_end pointer from the
inode but that requires more intrusive changes so we leave that for
later.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19 00:33:21 -05:00
Jan Kara
ed8ad83808 ext4: fix bh->b_state corruption
ext4 can update bh->b_state non-atomically in _ext4_get_block() and
ext4_da_get_block_prep(). Usually this is fine since bh is just a
temporary storage for mapping information on stack but in some cases it
can be fully living bh attached to a page. In such case non-atomic
update of bh->b_state can race with an atomic update which then gets
lost. Usually when we are mapping bh and thus updating bh->b_state
non-atomically, nobody else touches the bh and so things work out fine
but there is one case to especially worry about: ext4_finish_bio() uses
BH_Uptodate_Lock on the first bh in the page to synchronize handling of
PageWriteback state. So when blocksize < pagesize, we can be atomically
modifying bh->b_state of a buffer that actually isn't under IO and thus
can race e.g. with delalloc trying to map that buffer. The result is
that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.

Fix the problem by always updating bh->b_state bits atomically.

CC: stable@vger.kernel.org
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19 00:18:25 -05:00
Jeff Layton
0918f1c309 fsnotify: turn fsnotify reaper thread into a workqueue job
We don't require a dedicated thread for fsnotify cleanup.  Switch it
over to a workqueue job instead that runs on the system_unbound_wq.

In the interest of not thrashing the queued job too often when there are
a lot of marks being removed, we delay the reaper job slightly when
queueing it, to allow several to gather on the list.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 16:23:24 -08:00
Jeff Layton
13d34ac6e5 Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
This reverts commit c510eff6be ("fsnotify: destroy marks with
call_srcu instead of dedicated thread").

Eryu reported that he was seeing some OOM kills kick in when running a
testcase that adds and removes inotify marks on a file in a tight loop.

The above commit changed the code to use call_srcu to clean up the
marks.  While that does (in principle) work, the srcu callback job is
limited to cleaning up entries in small batches and only once per jiffy.
It's easily possible to overwhelm that machinery with too many call_srcu
callbacks, and Eryu's reproduer did just that.

There's also another potential problem with using call_srcu here.  While
you can obviously sleep while holding the srcu_read_lock, the callbacks
run under local_bh_disable, so you can't sleep there.

It's possible when putting the last reference to the fsnotify_mark that
we'll end up putting a chain of references including the fsnotify_group,
uid, and associated keys.  While I don't see any obvious ways that that
could occurs, it's probably still best to avoid using call_srcu here
after all.

This patch reverts the above patch.  A later patch will take a different
approach to eliminated the dedicated thread here.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Cc: Jan Kara <jack@suse.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 16:23:24 -08:00
Mimi Zohar
bc8ca5b92d vfs: define kernel_read_file_id enumeration
To differentiate between the kernel_read_file() callers, this patch
defines a new enumeration named kernel_read_file_id and includes the
caller identifier as an argument.

Subsequent patches define READING_KEXEC_IMAGE, READING_KEXEC_INITRAMFS,
READING_FIRMWARE, READING_MODULE, and READING_POLICY.

Changelog v3:
- Replace the IMA specific enumeration with a generic one.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-18 17:14:04 -05:00
Mimi Zohar
b44a7dfc6f vfs: define a generic function to read a file from the kernel
For a while it was looked down upon to directly read files from Linux.
These days there exists a few mechanisms in the kernel that do just
this though to load a file into a local buffer.  There are minor but
important checks differences on each.  This patch set is the first
attempt at resolving some of these differences.

This patch introduces a common function for reading files from the kernel
with the corresponding security post-read hook and function.

Changelog v4+:
- export security_kernel_post_read_file() - Fengguang Wu
v3:
- additional bounds checking - Luis
v2:
- To simplify patch review, re-ordered patches

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Reviewed-by: Luis R. Rodriguez <mcgrof@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-18 17:14:03 -05:00
Ingo Molnar
3a2f2ac9b9 Merge branch 'x86/urgent' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 09:28:03 +01:00
Linus Torvalds
2850713576 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A collection of fixes from the past few weeks that should go into 4.5.
  This contains:

   - Overflow fix for sysfs discard show function from Alan.

   - A stacking limit init fix for max_dev_sectors, so we don't end up
     artificially capping some use cases.  From Keith.

   - Have blk-mq proper end unstarted requests on a dying queue, instead
     of pushing that to the driver.  From Keith.

   - NVMe:
        - Update to Kconfig description for NVME_SCSI, since it was
          vague and having it on is important for some SUSE distros.
          From Christoph.
        - Set of fixes from Keith, around surprise removal. Also kills
          the no-merge flag, so it supports merging.

   - Set of fixes for lightnvm from Matias, Javier, and Wenwei.

   - Fix null_blk oops when asked for lightnvm, but not available.  From
     Matias.

   - Copy-to-user EINTR fix from Hannes, fixing a case where SG_IO fails
     if interrupted by a signal.

   - Two floppy fixes from Jiri, fixing signal handling and blocking
     open.

   - A use-after-free fix for O_DIRECT, from Mike Krinkin.

   - A block module ref count fix from Roman Pen.

   - An fs IO wait accounting fix for O_DSYNC from Stephane Gasparini.

   - Smaller reallo fix for xen-blkfront from Bob Liu.

   - Removal of an unused struct member in the deadline IO scheduler,
     from Tahsin.

   - Also from Tahsin, properly initialize inode struct members
     associated with cgroup writeback, if enabled.

   - From Tejun, ensure that we keep the superblock pinned during cgroup
     writeback"

* 'for-linus' of git://git.kernel.dk/linux-block: (25 commits)
  blk: fix overflow in queue_discard_max_hw_show
  writeback: initialize inode members that track writeback history
  writeback: keep superblock pinned during cgroup writeback association switches
  bio: return EINTR if copying to user space got interrupted
  NVMe: Rate limit nvme IO warnings
  NVMe: Poll device while still active during remove
  NVMe: Requeue requests on suspended queues
  NVMe: Allow request merges
  NVMe: Fix io incapable return values
  blk-mq: End unstarted requests on dying queue
  block: Initialize max_dev_sectors to 0
  null_blk: oops when initializing without lightnvm
  block: fix module reference leak on put_disk() call for cgroups throttle
  nvme: fix Kconfig description for BLK_DEV_NVME_SCSI
  kernel/fs: fix I/O wait not accounted for RW O_DSYNC
  floppy: refactor open() flags handling
  lightnvm: allow to force mm initialization
  lightnvm: check overflow and correct mlc pairs
  lightnvm: fix request intersection locking in rrpc
  lightnvm: warn if irqs are disabled in lock laddr
  ...
2016-02-17 11:59:23 -08:00
Kinglong Mee
c89757061a pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
unreferenced object 0xffffc90000abf000 (size 16900):
  comm "fsync02", pid 15765, jiffies 4297431627 (age 423.772s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 a0 c2 19 00 88 ff ff  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8174d54e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811b9b91>] __vmalloc_node_range+0x231/0x280
    [<ffffffff811b9c2a>] __vmalloc+0x4a/0x50
    [<ffffffffa02c9ec1>] ext_tree_prepare_commit+0x231/0x2e0 [blocklayoutdriver]
    [<ffffffffa02c700e>] bl_prepare_layoutcommit+0xe/0x10 [blocklayoutdriver]
    [<ffffffffa0596a6c>] pnfs_layoutcommit_inode+0x29c/0x330 [nfsv4]
    [<ffffffffa0596b13>] pnfs_generic_sync+0x13/0x20 [nfsv4]
    [<ffffffffa0585188>] nfs4_file_fsync+0x58/0x150 [nfsv4]
    [<ffffffff81228e5b>] vfs_fsync_range+0x4b/0xb0
    [<ffffffff81228f1d>] do_fsync+0x3d/0x70
    [<ffffffff812291d0>] SyS_fsync+0x10/0x20
    [<ffffffff81757def>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff

v2, add missing include header

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17 11:44:45 -05:00
Christoph Hellwig
4bdf87ebda nfs4: fix stateid handling for the NFS v4.2 operations
The newly added NFS v4.2 operations (ALLOCATE, DEALLOCATE, SEEK and CLONE)
use a helper called nfs42_set_rw_stateid to select a stateid that is sent
to the server.  But they don't set the inode and state fields in the
nfs4_exception structure, and this don't partake in the stateid recovery
protocol.  Because of this they will simply return errors insted of trying
to recover a stateid when the server return a BAD_STATEID error.

Additionally CLONE has the problem that it operates on two files and thus
two stateids, and thus needs to call the exception handler twice to
recover stateids.

While we're at it stop grabbing an addititional reference to the open
context in all these operations - having the file open guarantees that
the open context won't go away.

All this can be produces with the generic/168 and generic/170 tests in
xfstests which stress the CLONE stateid handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17 11:38:07 -05:00
Benjamin Coddington
d9dfd8d741 NFSv4: Fix a dentry leak on alias use
In the case where d_add_unique() finds an appropriate alias to use it will
have already incremented the reference count.  An additional dget() to swap
the open context's dentry is unnecessary and will leak a reference.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 275bb30786 ("NFSv4: Move dentry instantiation into the NFSv4-...")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17 11:35:25 -05:00
Tahsin Erdogan
3d65ae4634 writeback: initialize inode members that track writeback history
inode struct members that track cgroup writeback information
should be reinitialized when inode gets allocated from
kmem_cache. Otherwise, their values remain and get used by the
new inode.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Fixes: d10c809552 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-16 14:57:21 -07:00
Linus Torvalds
65c23c65be Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "A small set of cifs fixes.

  I am still reviewing some more, recently submitted SMB3 fixes, but
  these three are small and safe and ready now"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix erroneous return value
  cifs: fix potential overflow in cifs_compose_mount_options
  cifs: remove redundant check for null string pointer
2016-02-16 10:52:59 -08:00
Tejun Heo
5ff8eaac16 writeback: keep superblock pinned during cgroup writeback association switches
If cgroup writeback is in use, an inode is associated with a cgroup
for writeback.  If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously.  Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.

 kernel BUG at fs/jbd2/transaction.c:319!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
 CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
 Hardware name: Google Google, BIOS Google 01/01/2011
 Workqueue: events inode_switch_wbs_work_fn
 task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
 RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
 RSP: 0018:ffff880209267c30  EFLAGS: 00010202
 ...
 Call Trace:
  [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
  [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
  [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
  [<ffffffff8035338b>] evict+0xbb/0x190
  [<ffffffff80354190>] iput+0x130/0x190
  [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
  [<ffffffff80279819>] process_one_work+0x129/0x300
  [<ffffffff80279b16>] worker_thread+0x126/0x480
  [<ffffffff8027ed14>] kthread+0xc4/0xe0
  [<ffffffff809771df>] ret_from_fork+0x3f/0x70

Fix it by bumping s_active while cgroup association switching is in
flight.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c809552 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-16 11:34:07 -07:00
Ingo Molnar
4682c211a8 * Prevent accidental deletion of EFI variables through efivarfs that
may brick machines. We use a whitelist of known-safe variables to
    allow things like installing distributions to work out of the box, and
    instead restrict vendor-specific variable deletion by making
    non-whitelist variables immutable - Peter Jones
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWvbEYAAoJEC84WcCNIz1VatYP/1kkly4lIuSYmaQrvF9V/L75
 lYNHjEURT55EDq4VAHH/wey3SbDkwy3wBsmfkkJTV1zhA+SHSAG2k097xGyLP6Xr
 X+htIj//HH7U3SRWk66UiwkY/866sXCqVRN2vvjBxvP9Z/rTDKe7zRQdVVdCt80P
 88H/1Nxy1S8eDExMGCvq8TbtWCSKV6P8197rUqUMf37Sbqr7yBM/sYDitdwOiGTW
 gzLwJjWJgDsKw+BWaj5NNZzVAb1Dgof5oEL5WGCU7gJSis08i4cHoRiwutYk2g8f
 ZbMnKvlFmiHGbjriowyNPm+pgRVDbS8JvJtORA1qXuVJFPtqV7Wdvdh+jJpdYXLp
 bO8EB/yfc7PTH8ScbNbIcgmCknsItRh2SDNXxM/BY/dzaSkzVI/Wr6GauWKInQJ6
 IypOMijITmnJ5Sij0V4aMUTZWS5btZt15iqAg3xUqWT9DJ61bIER+eGEhV6hx+7S
 pSydQylbaVFpyswdCpJRsfxHfW5j0G9BxnKZGTh+LHeb6dXaughUq2EIdUNHWyEZ
 3geJPC3Mh50MngO8phIq+DzjA4K84JZ9j6M3O27+x3bfLAqiktZS6HiaTSmSGNyM
 95swhpyHREeLQqYUUTOWiz1rlQ9cW+Bmkhy7Wn3RBZ033YNtmpyoZup0432mwkMm
 Wur3Jxd0GFz7zUkqvN3O
 =F4YT
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent

Pull EFI fixes from Matt Fleming:

 * Prevent accidental deletion of EFI variables through efivarfs that
   may brick machines. We use a whitelist of known-safe variables to
   allow things like installing distributions to work out of the box, and
   instead restrict vendor-specific variable deletion by making
   non-whitelist variables immutable (Peter Jones)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-16 13:14:57 +01:00
Kirill Tkhai
c906f38e88 ext4: fix memleak in ext4_readdir()
When ext4_bread() fails, fname_crypto_str remains
allocated after return. Fix that.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
2016-02-16 00:20:19 -05:00
Filipe Manana
1636d1d77e Btrfs: fix direct IO requests not reporting IO error to user space
If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:

   dio_end_io(dio_bio, bio->bi_error);

It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().

This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.

Cc: stable@vger.kernel.org  # 4.3+
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-16 03:41:26 +00:00
Trond Myklebust
e0fa0d0189 pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
When setting the layout return mode, we must always also set the
NFS_LAYOUT_RETURN_REQUESTED flag to ensure that we send a layoutreturn.
Otherwise pnfs_error_mark_layout_for_return() could set the mode, but
fail to send the layoutreturn because another is already in flight.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-15 13:03:30 -05:00
Trond Myklebust
2f21596882 pNFS: Fix pnfs_mark_matching_lsegs_return()
We don't need to schedule a layoutreturn if the layout segment can
be freed immediately.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-15 12:56:01 -05:00
Greg Kroah-Hartman
c21b04f989 Merge 4.5-rc4 into driver-core-next
We want the fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-14 14:29:55 -08:00
Greg Kroah-Hartman
1b79dff672 Merge 4.5-rc4 into char-misc-next
We want those fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-14 14:25:59 -08:00
Linus Torvalds
779ee19da7 tty/serial fixes for 4.5-rc4
Here are a number of small tty and serial driver fixes for 4.5-rc4 that
 resolve some reported issues.
 
 One of them got reverted as it wasn't correct based on testing, and all
 have been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlbAzkEACgkQMUfUDdst+ylE4QCfXW10ziXSblRUIJubEm45Qhn2
 WJAAoLFMd/eER2TFkBl4E2Y3I7HUaL5d
 =V2Vb
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
 "Here are a number of small tty and serial driver fixes for 4.5-rc4
  that resolve some reported issues.

  One of them got reverted as it wasn't correct based on testing, and
  all have been in linux-next for a while"

* tag 'tty-4.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "8250: uniphier: allow modular build with 8250 console"
  pty: make sure super_block is still valid in final /dev/tty close
  pty: fix possible use after free of tty->driver_data
  tty: Add support for PCIe WCH382 2S multi-IO card
  serial/omap: mark wait_for_xmitr as __maybe_unused
  serial: omap: Prevent DoS using unprivileged ioctl(TIOCSRS485)
  8250: uniphier: allow modular build with 8250 console
  tty: Drop krefs for interrupted tty lock
2016-02-14 12:29:59 -08:00
Linus Torvalds
27c9d772e5 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has a few fixes from Filipe, along with a readdir fix from Dave
  that we've been testing for some time"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: properly set the termination value of ctx->pos in readdir
  Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
  Btrfs: remove no longer used function extent_read_full_page_nolock()
  Btrfs: fix page reading in extent_same ioctl leading to csum errors
  Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
2016-02-12 09:21:28 -08:00
Linus Torvalds
dfc852864d xfs: updates for 4.5-rc4
Contains:
 o fix for endian conversion issue in new CRC validation in
   log recovery that was discovered on a ppc64 platform.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJWvVMvAAoJEK3oKUf0dfodgugP/ioNOzuHYWUabT9Eltq+5ca4
 r4/vp8OcQfxQDepWacPhO3K5sR8yI7eM5KG0+NmrIO06HypLjQpLT9h6RB/oj7G2
 bIpSzYwWC/D2fCt3HNMLfHmbZRCAOSP8szeKaWp+G3VfF9cZKVpiqm0QJdQc6xi2
 52ePFw/Irx2AaThW41U6UP+B9Efw7bBQkSd4ogHXMdilG1tqsWFxTAee7V7+Z3oQ
 Pj5g6DpOpV5we3EoYGgdVu4+b66AS5I5srhm8to2wHGu22IE3z4rn1B151gRNVx7
 qrtDGdPQcgJkg5g9Ez3LLX1ikN3cBn67ZNGNXpkDveABdS17trgQ5IWxNB6Z57zq
 +nnSQnwhmr/CCZoP9F2OzVE4WUtqr5fk07NEx6eOyHcVMQeeW2fzwjzL/BXZze4/
 55GwrQU86VYHjMXeD8876qRz108v5/tlBqBrGJ6qxAMEPdKizxJFbQEE06LMz1Mi
 v9hga7ssJilo4BDIsrDZ/nkdT01rY8HZiIMUd117TWdhwafvH8wyp9a/wtdbRvu7
 tcea+n2gTGkkGcfTmgWQBvtqKU/gpDDLtpIO5zZzNE1kCiITcwnANKus9Ha0/DNj
 WBdO/a+bN+i32G0+SI9qnYMCeGMgG6FBLGVk9ZytBvRLr4Fkq2eVFmK1bWpvTMTE
 vn8QunELgutZxR2PnOyc
 =ti5p
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fix from Dve Chinner:
 "This contains a fix for an endian conversion issue in new CRC
  validation in log recovery that was discovered on a ppc64 platform"

* tag 'xfs-fixes-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: fix endianness error when checking log block crc on big endian platforms
2016-02-12 09:17:03 -08:00
Eryu Guan
56263b4ceb ext4: remove unused parameter "newblock" in convert_initialized_extent()
The "newblock" parameter is not used in convert_initialized_extent(),
remove it.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-12 01:23:00 -05:00
Eryu Guan
bcff24887d ext4: don't read blocks from disk after extents being swapped
I notice ext4/307 fails occasionally on ppc64 host, reporting md5
checksum mismatch after moving data from original file to donor file.

The reason is that move_extent_per_page() calls __block_write_begin()
and block_commit_write() to write saved data from original inode blocks
to donor inode blocks, but __block_write_begin() not only maps buffer
heads but also reads block content from disk if the size is not block
size aligned.  At this time the physical block number in mapped buffer
head is pointing to the donor file not the original file, and that
results in reading wrong data to page, which get written to disk in
following block_commit_write call.

This also can be reproduced by the following script on 1k block size ext4
on x86_64 host:

    mnt=/mnt/ext4
    donorfile=$mnt/donor
    testfile=$mnt/testfile
    e4compact=~/xfstests/src/e4compact

    rm -f $donorfile $testfile

    # reserve space for donor file, written by 0xaa and sync to disk to
    # avoid EBUSY on EXT4_IOC_MOVE_EXT
    xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile

    # create test file written by 0xbb
    xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile

    # compute initial md5sum
    md5sum $testfile | tee md5sum.txt
    # drop cache, force e4compact to read data from disk
    echo 3 > /proc/sys/vm/drop_caches

    # test defrag
    echo "$testfile" | $e4compact -i -v -f $donorfile
    # check md5sum
    md5sum -c md5sum.txt

Fix it by creating & mapping buffer heads only but not reading blocks
from disk, because all the data in page is guaranteed to be up-to-date
in mext_page_mkuptodate().

Cc: stable@vger.kernel.org
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-12 01:20:43 -05:00
Insu Yun
46901760b4 ext4: fix potential integer overflow
Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
integer overflow could be happened.
Therefore, need to fix integer overflow sanitization.

Cc: stable@vger.kernel.org
Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-12 01:15:59 -05:00
Huaitong Han
802cf1f9f5 ext4: add a line break for proc mb_groups display
This patch adds a line break for proc mb_groups display.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-02-12 00:17:16 -05:00
Anton Protopopov
fdde368e7c ext4: ioctl: fix erroneous return value
The ext4_ioctl_setflags() function which is used in the ioctls
EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value
EPERM instead of -EPERM in case of error. This bug was introduced by a
recent commit 9b7365fc.

The following program can be used to illustrate the wrong behavior:

    #include <sys/types.h>
    #include <sys/ioctl.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <err.h>

    #define FS_IOC_GETFLAGS _IOR('f', 1, long)
    #define FS_IOC_SETFLAGS _IOW('f', 2, long)
    #define FS_IMMUTABLE_FL 0x00000010

    int main(void)
    {
        int fd;
        long flags;

        fd = open("file", O_RDWR|O_CREAT, 0600);
        if (fd < 0)
            err(1, "open");

        if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0)
            err(1, "ioctl: FS_IOC_GETFLAGS");

        flags |= FS_IMMUTABLE_FL;

        if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0)
            err(1, "ioctl: FS_IOC_SETFLAGS");

        warnx("ioctl returned no error");

        return 0;
    }

Running it gives the following result:

    $ strace -e ioctl ./test
    ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0
    ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1
    test: ioctl returned no error
    +++ exited with 0 +++

Running the program on a kernel with the bug fixed gives the proper result:

    $ strace -e ioctl ./test
    ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0
    ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted)
    test: ioctl: FS_IOC_SETFLAGS: Operation not permitted
    +++ exited with 1 +++

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-11 23:57:21 -05:00
Jan Kara
05145bd799 ext4: fix scheduling in atomic on group checksum failure
When block group checksum is wrong, we call ext4_error() while holding
group spinlock from ext4_init_block_bitmap() or
ext4_init_inode_bitmap() which results in scheduling while in atomic.
Fix the issue by calling ext4_error() later after dropping the spinlock.

CC: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-02-11 23:15:12 -05:00
David Sterba
bc4ef7592f btrfs: properly set the termination value of ctx->pos in readdir
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.

There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.

We can get to that situation like that:

* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
  bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX

Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.

The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:

 Overflow: e
 Overflow: 7fffffff
 Overflow: 7fffffffffffffff
 PAX: size overflow detected in function btrfs_real_readdir
   fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
   context: dir_context;
 CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
  ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
  ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
  ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
 Call Trace:
  [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
  [<ffffffff811cb706>] report_size_overflow+0x36/0x40
  [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
  [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
  [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
  [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
 Overflow: 1a
  [<ffffffff811db070>] ? iterate_dir+0x150/0x150
  [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83

The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.

The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.

References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
Reported-by: Victor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-02-11 07:01:59 -08:00
Anton Protopopov
4b550af519 cifs: fix erroneous return value
The setup_ntlmv2_rsp() function may return positive value ENOMEM instead
of -ENOMEM in case of kmalloc failure.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-10 18:23:31 -06:00
Insu Yun
f34d69c3e5 cifs: fix potential overflow in cifs_compose_mount_options
In worst case, "ip=" + sb_mountdata + ipv6 can be copied into mountdata.
Therefore, for safe, it is better to add more size when allocating memory.

Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-10 18:04:56 -06:00
Colin Ian King
997152f627 cifs: remove redundant check for null string pointer
server_RFC1001_name is declared as a RFC1001_NAME_LEN_WITH_NULL sized
char array in struct TCP_Server_Info so the null pointer check on
server_RFC1001_name is redundant and can be removed.  Detected with
smatch:

fs/cifs/connect.c:2982 ip_rfc1001_connect() warn: this array is probably
  non-NULL. 'server->server_RFC1001_name'

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-10 18:04:53 -06:00
Peter Jones
ed8b0de5a3 efi: Make efivarfs entries immutable by default
"rm -rf" is bricking some peoples' laptops because of variables being
used to store non-reinitializable firmware driver data that's required
to POST the hardware.

These are 100% bugs, and they need to be fixed, but in the mean time it
shouldn't be easy to *accidentally* brick machines.

We have to have delete working, and picking which variables do and don't
work for deletion is quite intractable, so instead make everything
immutable by default (except for a whitelist), and make tools that
aren't quite so broad-spectrum unset the immutable flag.

Signed-off-by: Peter Jones <pjones@redhat.com>
Tested-by: Lee, Chun-Yi <jlee@suse.com>
Acked-by: Matthew Garrett <mjg59@coreos.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-02-10 16:25:52 +00:00
Peter Jones
e0d64e6a88 efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version
Translate EFI's UCS-2 variable names to UTF-8 instead of just assuming
all variable names fit in ASCII.

Signed-off-by: Peter Jones <pjones@redhat.com>
Acked-by: Matthew Garrett <mjg59@coreos.com>
Tested-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-02-10 13:19:14 +00:00
Bamvor Jian Zhang
6a4f555aa1 fs/compat: remove useless compat ioctl for parport device
Compat ioctl is already introduced in drivers/char/ppdev.c in order to
fix y2038 issue for PP[GS]ETTIME. There is no need to define these
here.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-08 15:00:04 -08:00
Theodore Ts'o
ff978b09f9 ext4 crypto: move context consistency check to ext4_file_open()
In the case where the per-file key for the directory is cached, but
root does not have access to the key needed to derive the per-file key
for the files in the directory, we allow the lookup to succeed, so
that lstat(2) and unlink(2) can suceed.  However, if a program tries
to open the file, it will get an ENOKEY error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-08 00:54:26 -05:00
Tejun Heo
e56ed358af kernfs: make kernfs_walk_ns() use kernfs_pr_cont_buf[]
kernfs_walk_ns() uses a static path_buf[PATH_MAX] to separate out path
components.  Keeping around the 4k buffer just for kernfs_walk_ns() is
wasteful.  This patch makes it piggyback on kernfs_pr_cont_buf[]
instead.  This requires kernfs_walk_ns() to hold kernfs_rename_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-07 20:21:35 -08:00
Theodore Ts'o
28b4c26396 ext4 crypto: revalidate dentry after adding or removing the key
Add a validation check for dentries for encrypted directory to make
sure we're not caching stale data after a key has been added or removed.

Also check to make sure that status of the encryption key is updated
when readdir(2) is executed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-07 19:35:05 -05:00
Darrick J. Wong
8e0bd4925b xfs: fix endianness error when checking log block crc on big endian platforms
Since the checksum function and the field are both __le32, don't
perform endian conversion when comparing the two.  This fixes mount
failures on ppc64.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 11:03:58 +11:00
Herton R. Krzesinski
1f55c718c2 pty: make sure super_block is still valid in final /dev/tty close
Considering current pty code and multiple devpts instances, it's possible
to umount a devpts file system while a program still has /dev/tty opened
pointing to a previosuly closed pty pair in that instance. In the case all
ptmx and pts/N files are closed, umount can be done. If the program closes
/dev/tty after umount is done, devpts_kill_index will use now an invalid
super_block, which was already destroyed in the umount operation after
running ->kill_sb. This is another "use after free" type of issue, but now
related to the allocated super_block instance.

To avoid the problem (warning at ida_remove and potential crashes) for
this specific case, I added two functions in devpts which grabs additional
references to the super_block, which pty code now uses so it makes sure
the super block structure is still valid until pty shutdown is done.
I also moved the additional inode references to the same functions, which
also covered similar case with inode being freed before /dev/tty final
close/shutdown.

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Cc: stable@vger.kernel.org # 2.6.29+
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-06 23:45:46 -08:00
Linus Torvalds
5af9c2e19d Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "22 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
  radix-tree: fix oops after radix_tree_iter_retry
  MAINTAINERS: trim the file triggers for ABI/API
  dax: dirty inode only if required
  thp: make deferred_split_scan() work again
  mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
  ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
  um: asm/page.h: remove the pte_high member from struct pte_t
  mm, hugetlb: don't require CMA for runtime gigantic pages
  mm/hugetlb: fix gigantic page initialization/allocation
  mm: downgrade VM_BUG in isolate_lru_page() to warning
  mempolicy: do not try to queue pages from !vma_migratable()
  mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress
  vmstat: make vmstat_update deferrable
  mm, vmstat: make quiet_vmstat lighter
  mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT
  memblock: don't mark memblock_phys_mem_size() as __init
  dump_stack: avoid potential deadlocks
  mm: validate_mm browse_rb SMP race condition
  m32r: fix build failure due to SMP and MMU
  ...
2016-02-05 20:20:07 -08:00
Linus Torvalds
5d6a6a75e0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "We have a few wire protocol compatibility fixes, ports of a few recent
  CRUSH mapping changes, and a couple error path fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: MOSDOpReply v7 encoding
  libceph: advertise support for TUNABLES5
  crush: decode and initialize chooseleaf_stable
  crush: add chooseleaf_stable tunable
  crush: ensure take bucket value is valid
  crush: ensure bucket id is valid before indexing buckets array
  ceph: fix snap context leak in error path
  ceph: checking for IS_ERR instead of NULL
2016-02-05 19:52:57 -08:00
Jason Baron
b6a515c8a0 epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
In the current implementation of the EPOLLEXCLUSIVE flag (added for
4.5-rc1), if epoll waiters create different POLL* sets and register them
as exclusive against the same target fd, the current implementation will
stop waking any further waiters once it finds the first idle waiter.
This means that waiters could miss wakeups in certain cases.

For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
wakeup since the current implementation will stop after it finds any
intersection of events with a waiter that is blocked in epoll_wait().

We could potentially address this by requiring all epoll waiters that
are added to p be required to pass the same set of POLL* events.  IE the
first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
However, I think it might be somewhat confusing interface as we would
have to reference count the number of users for that set, and so
userspace would have to keep track of that count, or we would need a
more involved interface.  It also adds some shared state that we'd have
store somewhere.  I don't think anybody will want to bloat
__wait_queue_head for this.

I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
once we hit the first idle waiter that specifies the EPOLLIN bit, since
any remaining waiters that only have 'POLLOUT' set wouldn't need to be
woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
wake bit set (there is at least one example of this I saw in fs/pipe.c),
then we just wake the entire exclusive list.  Having both 'POLLOUT' and
'POLLIN' both set should not be on any performance critical path, so I
think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
so.

Since epoll waiters may be interested in other events as well besides
EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
doing a 'dup' call on the target fd and adding that as one normally
would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
events are what we are interest in balancing, I think that the 'dup'
thing could perhaps be added to only one of the waiter threads.
However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
sufficient for the majority of use-cases.

Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
among multiple epfds, where between 1 and n of the epfds may receive an
event, it does not satisfy the semantics of EPOLLONESHOT where only 1
epfd would get an event.  Thus, it is not allowed to be specified in
conjunction with EPOLLEXCLUSIVE.

EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Madars Vitolins <m@silodev.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05 18:10:40 -08:00
Dmitry Monakhov
d2b2a28e64 dax: dirty inode only if required
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05 18:10:40 -08:00
xuejiufei
c95a51807b ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
When recovery master down, dlm_do_local_recovery_cleanup() only remove
the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
Which will make umount thread falling in dead loop migrating $RECOVERY
to the dead node.

Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05 18:10:40 -08:00