Commit graph

36971 commits

Author SHA1 Message Date
Zhang Zhen
590a141863 ext4: remove readpage() check in ext4_mmap_file()
There is no kind of file which does not supply a page reading function.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 09:56:19 -04:00
Lukas Czerner
4f579ae7de ext4: fix punch hole on files with indirect mapping
Currently punch hole code on files with direct/indirect mapping has some
problems which may lead to a data loss. For example (from Jan Kara):

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096.

Also the code is a bit weird and it's not using infrastructure provided
by indirect.c, but rather creating it's own way.

This patch fixes the issues as well as making the operation to run 4
times faster from my testing (punching out 60GB file). It uses similar
approach used in ext4_ind_truncate() which takes advantage of
ext4_free_branches() function.

Also rename the ext4_free_hole_blocks() to something more sensible, like
the equivalent we have for extent mapped files. Call it
ext4_ind_remove_space().

This has been tested mostly with fsx and some xfstests which are testing
punch hole but does not require unwritten extents which are not
supported with direct/indirect mapping. Not problems showed up even with
1024k block size.

CC: stable@vger.kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 06:03:38 -04:00
Theodore Ts'o
71d4f7d032 ext4: remove metadata reservation checks
Commit 27dd438542 ("ext4: introduce reserved space") reserves 2% of
the file system space to make sure metadata allocations will always
succeed.  Given that, tracking the reservation of metadata blocks is
no longer necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 06:02:38 -04:00
Theodore Ts'o
d5e03cbb0c ext4: rearrange initialization to fix EXT4FS_DEBUG
The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for
ext4 developers only.  (You have to modify fs/ext4/ext4.h to enable
it.)

Rearrange how we initialize data structures to avoid calling
ext4_count_free_clusters() until the multiblock allocator has been
initialized.

This also allows us to only call ext4_count_free_clusters() once, and
simplifies the code somewhat.

(Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a
!CONFIG_SMP compile breakage in the original patch.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-15 06:01:38 -04:00
Dave Chinner
03e01349c6 xfs: null unused quota inodes when quota is on
When quota is on, it is expected that unused quota inodes have a
value of NULLFSINO. The changes to support a separate project quota
in 3.12 broken this rule for non-project quota inode enabled
filesystem, as the code now refuses to write the group quota inode
if neither group or project quotas are enabled. This regression was
introduced by commit d892d58 ("xfs: Start using pquotaino from the
superblock").

In this case, we should be writing NULLFSINO rather than nothing to
ensure that we leave the group quota inode in a valid state while
quotas are enabled.

Failure to do so doesn't cause a current kernel to break - the
separate project quota inodes introduced translation code to always
treat a zero inode as NULLFSINO. This was introduced by commit
0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
also in 3.12 but older kernels do not do this and hence taking a
filesystem back to an older kernel can result in quotas failing
initialisation at mount time. When that happens, we see this in
dmesg:

[ 1649.215390] XFS (sdb): Mounting Filesystem
[ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
[ 1649.316902] XFS (sdb): Ending clean mount

By ensuring that we write NULLFSINO to quota inodes that aren't
active, we avoid this problem. We have to be really careful when
determining if the quota inodes are active or not, because we don't
want to write a NULLFSINO if the quota inodes are active and we
simply aren't updating them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:28:41 +10:00
Dave Chinner
cf11da9c5d xfs: refine the allocation stack switch
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.

Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.

This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows.  Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.

To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.

A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.

Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.

Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.

The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:

37)     1768      64   kmem_cache_alloc+0x13b/0x170

about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.

The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.

Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:24 +10:00
Dave Chinner
aa182e64f1 Revert "xfs: block allocation work needs to be kswapd aware"
This reverts commit 1f6d64829d.

This commit resulted in regressions in performance in low
memory situations where kswapd was doing writeback of delayed
allocation blocks. It resulted in significant parallelism of the
kswapd work and with the special kswapd flags meant that hundreds of
active allocation could dip into kswapd specific memory reserves and
avoid being throttled. This cause a large amount of performance
variation, as well as random OOM-killer invocations that didn't
previously exist.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:10 +10:00
Benjamin LaHaise
263782c1c9 aio: protect reqs_available updates from changes in interrupt handlers
As of commit f8567a3845 it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot <Elliott@hp.com>
Tested-by: Robert Elliot <Elliott@hp.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org
2014-07-14 13:05:26 -04:00
Fabian Frederick
f2b3455e47 fuse: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-14 16:30:25 +02:00
Maxim Patlasov
27f1b36326 fuse: release temporary page if fuse_writepage_locked() failed
tmp_page to be freed if fuse_write_file_get() returns NULL.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-14 16:17:57 +02:00
Christoph Hellwig
73a8f5f7e6 locks: purge fl_owner_t from fs/locks.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-07-13 21:39:07 -04:00
Linus Torvalds
18b34d9a7a More bug fixes for ext4 -- most importantly, a fix for a bug
(introduced in 3.15) that can end up triggering a file system
 corruption error after a journal replay.  (It shouldn't lead to any
 actual data corruption, but it is scary and can force file systems to
 be remounted read-only, etc.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJTwuY+AAoJENNvdpvBGATwgN8QAJ2/S5GFxQwHbglHmayXYuMQ
 fU411FwJ1wbqjYyYb+jyBoYcsgpsCKPTqA2JbPlHsFTm2Ec+BPzsybhtYw5ybdeW
 1qAfPTSgNxYXroNwpaqOamxgfXgOaV4iqwvZ4tYcLcrtPq0MOcC5rlSaKMdJuSA1
 6M2/8PijOTndUVJpS/GhSMdKlTAXjtfv9V6t/pfLuoo7cNadlggpJnwC8Qm9DNAA
 5ETVZK44q2+2YvGwrvY6LBb9BVBpL29YbWPNqqw/OXXY++ZFhBJV07osZO38MpsB
 QzUyfRaMTgm9/BdbkG8uxA7Zk6C0YBl5eC4aU79LWGWjGO225CLj95LoBOVjQw9f
 eh+RFGapwVvtyzScDF/a9pH6UwGco/s4kCq8rLr2ztljlO595N3LUwhQBHtiSGtm
 fr65NRDyJMXbqy8yLGrlOnP/4ll2VfTH+el2+tzr5smoTD29EASM155hKDDUOAG0
 TrDHtNrxG1MIROHjp+HSui424Op7NXTnfjwmuKzo+mGpPOcPclPSmAacFJpRGVBE
 220hnk+LrBf525nJzQYHifdCL+JAqbWv/S4YSRGizgppK3DlO/gYcu1zpWb0WWuo
 0VuvxUZDSIZY1aVpMEOQov74WtovB7YyG8RPHl7h2m5dJuLLFgJmLDMDTJR1LLNT
 +tHNJ6jERLQz9wqTvquh
 =OX7Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "More bug fixes for ext4 -- most importantly, a fix for a bug
  introduced in 3.15 that can end up triggering a file system corruption
  error after a journal replay.

  It shouldn't lead to any actual data corruption, but it is scary and
  can force file systems to be remounted read-only, etc"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix potential null pointer dereference in ext4_free_inode
  ext4: fix a potential deadlock in __ext4_es_shrink()
  ext4: revert commit which was causing fs corruption after journal replays
  ext4: disable synchronous transaction batching if max_batch_time==0
  ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
  ext4: clarify error count warning messages
  ext4: fix unjournalled bg descriptor while initializing inode bitmap
2014-07-13 13:14:55 -07:00
Trond Myklebust
f563b89b18 NFS: Don't reset pg_moreio in __nfs_pageio_add_request
Once we've started sending unstable NFS writes, we do not want to
clear pg_moreio, or we may end up sending the very last request as
a stable write if the commit lists are still empty.

Do, however, reset pg_moreio in the case where we end up having to
recoalesce the write if an attempt to use pNFS failed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-13 15:18:44 -04:00
Trond Myklebust
aafe37504c NFS: Remove 2 unused variables
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:57 -04:00
Weston Andros Adamson
3e2170451e nfs: handle multiple reqs in nfs_wb_page_cancel
Use nfs_lock_and_join_requests to merge all subrequests into the head request -
this cancels and dereferences all subrequests.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:47 -04:00
Weston Andros Adamson
d458138353 nfs: handle multiple reqs in nfs_page_async_flush
Change nfs_find_and_lock_request so nfs_page_async_flush can handle multiple
requests in a page. There is only one request for a page the first time
nfs_page_async_flush is called, but if a write or commit fails, async_flush
is called again and there may be multiple requests associated with the page.
The solution is to merge all the requests in a page group into a single
request before calling nfs_pageio_add_request.

Rename nfs_find_and_lock_request to nfs_lock_and_join_requests and
change it to first lock all requests for the page, then cancel and merge
all subrequests into the head request.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:46 -04:00
Weston Andros Adamson
84d3a9a913 nfs: change find_request to find_head_request
nfs_page_find_request_locked* should find the head request for that page.
Rename the functions and add comments to make this clear, and fix a bug
that could return a subrequest when page_private isn't set on the page.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Weston Andros Adamson
85710a837c nfs: nfs_page should take a ref on the head req
nfs_pages that aren't the the head of a group must take a reference on the
head as long as ->wb_head is set to it. This stops the head from hitting
a refcount of 0 while there is still an active nfs_page for the page group.

This avoids kref warnings in the writeback code when the page group head
is found and referenced.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Weston Andros Adamson
17089a29a2 nfs: mark nfs_page reqs with flag for extra ref
Change the use of PG_INODE_REF - set it when taking extra reference on
subrequests and take care to only release once for each request.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Namjae Jeon
bf40c92635 ext4: fix potential null pointer dereference in ext4_free_inode
Fix potential null pointer dereferencing problem caused by e43bb4e612
("ext4: decrement free clusters/inodes counters when block group declared bad")

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-12 16:11:42 -04:00
Theodore Ts'o
3f1f9b8513 ext4: fix a potential deadlock in __ext4_es_shrink()
This fixes the following lockdep complaint:

[ INFO: possible circular locking dependency detected ]
3.16.0-rc2-mm1+ #7 Tainted: G           O  
-------------------------------------------------------
kworker/u24:0/4356 is trying to acquire lock:
 (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0

but task is already holding lock:
 (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

which lock already depends on the new lock.

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ei->i_es_lock);
                               lock(&(&sbi->s_es_lru_lock)->rlock);
                               lock(&ei->i_es_lock);
  lock(&(&sbi->s_es_lru_lock)->rlock);

 *** DEADLOCK ***

6 locks held by kworker/u24:0/4356:
 #0:  ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 #2:  (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90
 #3:  (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0
 #4:  (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550
 #5:  (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

stack backtrace:
CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G           O   3.16.0-rc2-mm1+ #7
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: writeback bdi_writeback_workfn (flush-253:0)
 ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007
 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568
 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930
Call Trace:
 [<ffffffff815df0bb>] dump_stack+0x4e/0x68
 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c
 [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00
 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe
 [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff810a8707>] lock_acquire+0x87/0x120
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70
 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0
 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180
 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550
 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00
	...

Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Minchan Kim <minchan@kernel.org>
Cc: stable@vger.kernel.org
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2014-07-12 15:32:24 -04:00
Linus Torvalds
bae78dc259 Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
 "Another xdr encoding regression that may cause incorrect encoding on
  failures of certain readdirs"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix bad reserving space for encoding rdattr_error
2014-07-11 15:10:04 -07:00
Gu Zheng
4b2868aa4f f2fs: remove the unused stat_lock
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-11 15:01:48 -07:00
Gu Zheng
7a6c76b1b2 f2fs: cleanup the needless return of f2fs_create_root_stats
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-11 15:01:47 -07:00
Theodore Ts'o
f9ae9cf5d7 ext4: revert commit which was causing fs corruption after journal replays
Commit 007649375f ("ext4: initialize multi-block allocator before
checking block descriptors") causes the block group descriptor's count
of the number of free blocks to become inconsistent with the number of
free blocks in the allocation bitmap.  This is a harmless form of fs
corruption, but it causes the kernel to potentially remount the file
system read-only, or to panic, depending on the file systems's error
behavior.

Thanks to Eric Whitney for his tireless work to reproduce and to find
the guilty commit.

Fixes: 007649375f ("ext4: initialize multi-block allocator before checking block descriptors"

Cc: stable@vger.kernel.org  # 3.15
Reported-by: David Jander <david@protonic.nl>
Reported-by: Matteo Croce <technoboy85@gmail.com>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Suggested-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-11 13:55:40 -04:00
Marcel Holtmann
f49daa8190 Bluetooth: Move HCI socket definitions into its own header file
All the HCI sockets and ioctl based definitions have been in a global
header file that also includes all the HCI protocol structures. To
make this a bit cleaner, move them into its own file.

This also adjusts fs/compat_ioctl.c to only include this new file
and not all the protocol structures that are not needed.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2014-07-11 13:53:04 +03:00
Chao Yu
81e366f87f f2fs: check name_len of dir entry to prevent from deadloop
We assume that modification of some special application could result in zeroed
name_len, or it is consciously made by somebody. We will deadloop in
find_in_block when name_len of dir entry is zero.

This patch is added for preventing deadloop in above scenario.

change log from v1:
 o use f2fs_bug_on rather than break out from searching dir entry suggested by
Jaegeuk Kim.

Jaegeuk describe:
"Well, IMO, it would be good to add f2fs_bug_on() here with a specific comment.
In the current phase of f2fs, it is more important to investigate the file
system bugs, rather than workarounds for any corrupted images.
And, definitely it needs to stop the kernel if any corrupted image was mounted,
so that we can figure out where the bugs are occurred."

Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-10 17:00:02 -07:00
Linus Torvalds
40f6123737 Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Mostly fixes for the fallouts from the recent cgroup core changes.

  The decoupled nature of cgroup dynamic hierarchy management
  (hierarchies are created dynamically on mount but may or may not be
  reused once unmounted depending on remaining usages) led to more
  ugliness being added to kernfs.

  Hopefully, this is the last of it"

* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: break kernfs active protection in cpuset_write_resmask()
  cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
  kernfs: introduce kernfs_pin_sb()
  cgroup: fix mount failure in a corner case
  cpuset,mempolicy: fix sleeping function called from invalid context
  cgroup: fix broken css_has_online_children()
2014-07-10 11:38:23 -07:00
Miklos Szeredi
4237ba43b6 fuse: restructure ->rename2()
Make ->rename2() universal, i.e. able to handle zero flags.  This is to
make future change of the API easier.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-10 10:50:19 +02:00
Rahul Bedarkar
88e412ea5e fs: debugfs: remove trailing whitespace
fixes checkpatch.pl trailing whitespace errors

Signed-off-by: Rahul Bedarkar <rahulbedarkar89@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 16:58:21 -07:00
Fabian Frederick
8278bd3abd kernfs: kernel-doc warning fix
s/static_name/name_is_static

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 16:37:29 -07:00
Steven Rostedt
485d44022a debugfs: Fix corrupted loop in debugfs_remove_recursive
[ I'm currently running my tests on it now, and so far, after a few
 hours it has yet to blow up. I'll run it for 24 hours which it never
 succeeded in the past. ]

The tracing code has a way to make directories within the debugfs file
system as well as deleting them using mkdir/rmdir in the instance
directory. This is very limited in functionality, such as there is
no renames, and the parent directory "instance" can not be modified.
The tracing code creates the instance directory from the debugfs code
and then replaces the dentry->d_inode->i_op with its own to allow
for mkdir/rmdir to work.

When these are called, the d_entry and inode locks need to be released
to call the instance creation and deletion code. That code has its own
accounting and locking to serialize everything to prevent multiple
users from causing harm. As the parent "instance" directory can not
be modified this simplifies things.

I created a stress test that creates several threads that randomly
creates and deletes directories thousands of times a second. The code
stood up to this test and I submitted it a while ago.

Recently I added a new test that adds readers to the mix. While the
instance directories were being added and deleted, readers would read
from these directories and even enable tracing within them. This test
was able to trigger a bug:

 general protection fault: 0000 [#1] PREEMPT SMP
 Modules linked in: ...
 CPU: 3 PID: 17789 Comm: rmdir Tainted: G        W     3.15.0-rc2-test+ #41
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
 task: ffff88003786ca60 ti: ffff880077018000 task.ti: ffff880077018000
 RIP: 0010:[<ffffffff811ed5eb>]  [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
 RSP: 0018:ffff880077019df8  EFLAGS: 00010246
 RAX: 0000000000000002 RBX: ffff88006f0fe490 RCX: 0000000000000000
 RDX: dead000000100058 RSI: 0000000000000246 RDI: ffff88003786d454
 RBP: ffff88006f0fe640 R08: 0000000000000628 R09: 0000000000000000
 R10: 0000000000000628 R11: ffff8800795110a0 R12: ffff88006f0fe640
 R13: ffff88006f0fe640 R14: ffffffff81817d0b R15: ffffffff818188b7
 FS:  00007ff13ae24700(0000) GS:ffff88007d580000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000003054ec7be0 CR3: 0000000076d51000 CR4: 00000000000007e0
 Stack:
  ffff88007a41ebe0 dead000000100058 00000000fffffffe ffff88006f0fe640
  0000000000000000 ffff88006f0fe678 ffff88007a41ebe0 ffff88003793a000
  00000000fffffffe ffffffff810bde82 ffff88006f0fe640 ffff88007a41eb28
 Call Trace:
  [<ffffffff810bde82>] ? instance_rmdir+0x15b/0x1de
  [<ffffffff81132e2d>] ? vfs_rmdir+0x80/0xd3
  [<ffffffff81132f51>] ? do_rmdir+0xd1/0x139
  [<ffffffff8124ad9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [<ffffffff814fea62>] ? system_call_fastpath+0x16/0x1b
 Code: fe ff ff 48 8d 75 30 48 89 df e8 c9 fd ff ff 85 c0 75 13 48 c7 c6 b8 cc d2 81 48 c7 c7 b0 cc d2 81 e8 8c 7a f5 ff 48 8b 54 24 08 <48> 8b 82 a8 00 00 00 48 89 d3 48 2d a8 00 00 00 48 89 44 24 08
 RIP  [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
  RSP <ffff880077019df8>

It took a while, but every time it triggered, it was always in the
same place:

	list_for_each_entry_safe(child, next, &parent->d_subdirs, d_u.d_child) {

Where the child->d_u.d_child seemed to be corrupted.  I added lots of
trace_printk()s to see what was wrong, and sure enough, it was always
the child's d_u.d_child field. I looked around to see what touches
it and noticed that in __dentry_kill() which calls dentry_free():

static void dentry_free(struct dentry *dentry)
{
	/* if dentry was never visible to RCU, immediate free is OK */
	if (!(dentry->d_flags & DCACHE_RCUACCESS))
		__d_free(&dentry->d_u.d_rcu);
	else
		call_rcu(&dentry->d_u.d_rcu, __d_free);
}

I also noticed that __dentry_kill() unlinks the child->d_u.child
under the parent->d_lock spin_lock.

Looking back at the loop in debugfs_remove_recursive() it never takes the
parent->d_lock to do the list walk. Adding more tracing, I was able to
prove this was the issue:

 ftrace-t-15385   1.... 246662024us : dentry_kill <ffffffff81138b91>: free ffff88006d573600
    rmdir-15409   2.... 246662024us : debugfs_remove_recursive <ffffffff811ec7e5>: child=ffff88006d573600 next=dead000000100058

The dentry_kill freed ffff88006d573600 just as the remove recursive was walking
it.

In order to fix this, the list walk needs to be modified a bit to take
the parent->d_lock. The safe version is no longer necessary, as every
time we remove a child, the parent->d_lock must be released and the
list walk must start over. Each time a child is removed, even though it
may still be on the list, it should be skipped by the first check
in the loop:

		if (!debugfs_positive(child))
			continue;

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 16:37:29 -07:00
Chao Yu
6b2920a513 f2fs: use inner macro and function to clean up codes
In this patch we use below inner macro and function to clean up codes.
1. ADDRS_PER_PAGE
2. SM_I
3. f2fs_readonly

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Chao Yu
3aab8f828e f2fs: introduce f2fs_write_failed to handle error case when write
When we fail in ->write_begin()/->direct_IO(), our allocated node block in disk
and page cache are still kept, despite these may not be used again.

This patch introduce f2fs_write_failed() to handle the error case of these two
interfaces, it will truncate page cache and blocks of this file according to
i_size.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng
eee6160f2e f2fs: arguments cleanup of finding file flow functions
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng
1c3bb97899 f2fs: remove the needless point-cast
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng
34e6d456da f2fs: remove the redundant validation check of acl
kernel side(xx_init_acl), the acl is get/cloned from the parent dir's,
which is credible. So remove the redundant validation check of acl
here.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Chao Yu
1256010ab1 f2fs: reduce region of f2fs_lock_op covered for better concurrency
In our rename process, region of f2fs_lock_op covered is too big as some of the
code like f2fs_empty_dir/f2fs_find_entry are not needed to protect by this lock.

So in the extreme case like doing checkpoint when we rename old inode to exist
inode in a large directory could cause lower concurrency.

Let's reduce the region of f2fs_lock_op to fix this.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Fabian Frederick
b434babf85 f2fs: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Chao Yu
aec71382c6 f2fs: refactor flush_nat_entries codes for reducing NAT writes
Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
   nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
   journal is full, then flush the left dirty entries to disk without merge
   journaled entries, so these journaled entries may be flushed to disk at next
   checkpoint but lost chance to flushed last time.

In this patch we merge dirty entries located in same NAT block to nat entry set,
and linked all set to list, sorted ascending order by entries' count of set.
Later we flush entries in sparse set into journal as many as we can, and then
flush merged entries to disk. In this way we can not only gain in performance,
but also save lifetime of flash device.

In my testing environment, it shows this patch can help to reduce NAT block
writes obviously. In hard disk test case: cost time of fsstress is stablely
reduced by about 5%.

1. virtual machine + hard disk:
fsstress -p 20 -n 200 -l 5
		node num	cp count	nodes/cp
based		4599.6		1803.0		2.551
patched		2714.6		1829.6		1.483

2. virtual machine + 32g micro SD card:
fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
-f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
-f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S

		node num	cp count	nodes/cp
based		84.5		43.7		1.933
patched		49.2		40.0		1.23

Our latency of merging op shows not bad when handling extreme case like:
merging a great number of dirty nats:
latency(ns)	dirty nat count
3089219		24922
5129423		27422
4000250		24523

change log from v1:
 o fix wrong logic in add_nat_entry when grab a new nat entry set.
 o swith to create slab cache in create_node_manager_caches.
 o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.

change log from v2:
 o make comment position more appropriate suggested by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Jaegeuk Kim
a014e037be f2fs: clean up an unused parameter and assignment
This patch cleans up simple unnecessary codes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Jaegeuk Kim
b97a9b5da8 f2fs: introduce f2fs_do_tmpfile for code consistency
This patch adds f2fs_do_tmpfile to eliminate the redundant init_inode_metadata
flow.
Throught this, we can provide the consistent lock usage, e.g., fi->i_sem,  and
this will enable better debugging stuffs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu
50732df02e f2fs: support ->tmpfile()
Add function f2fs_tmpfile() to support O_TMPFILE file creation, and modify logic
of init_inode_metadata to enable linkat temp file.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu
ca0a81b397 f2fs: avoid to truncate non-updated page partially
After we call find_data_page in truncate_partial_data_page, we could not
guarantee this page is updated or not as error may occurred in lower layer.

We'd better check status of the page to avoid this no updated page be
writebacked to device.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu
5576cd6ca5 f2fs: avoid unneeded SetPageUptodate in f2fs_write_end
We have already set page update in ->write_begin, so we should remove redundant
SetPageUptodate in ->write_end.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Linus Torvalds
191d385f25 f2fs bugfixes for 3.16
o fix normal and recovery path for fallocated regions
 o fix error case mishandling
 o recover renamed fsync inodes correctly
 o fix to get out of infinite loops in balance_dirty_pages
 o fix kernel NULL pointer error
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTvUA5AAoJEEAUqH6CSFDSSKgP/RQ6ryncwwSUilDswq95/VI1
 qXwAlHLBgJkPquld6Klqw//4ot49sThCjBtusxdNqoyB5aSb/xqupJxRvCrJe1RQ
 dRDYP1Mq63phd0cWsjAokfwXuiJQ2Ys/1bq2HguzAhL+7qNVNJEoy27ISUgvh71J
 3v9pTfOqFY/qMxAa1Y91kIat3/27QTCtVQdS1sQM7s8UXlZHIIGyxrSmYWPUGNar
 yVtMNtgMQcEtmekRAjstM0glj3IukosTP1jameXYumEw9bchfIeeLznvtDiEqxKA
 maXtEPA+yrEk5y+RhOiBgaHuV/9uNmrHHvTwoqhMl9Wl+I4RzxpOhD2agRAUFbdn
 rvPKU514tsjhkdelSYf0v2rXf0PxZcZ5XE27TZ+xyhCADKykBdN5ZzTH1OUWjEOA
 TNdPVKv2btpvEdGdmdGzjKIQpPfjLgJLAKqDNNTSQ3u4XlVioMn6IyzEGddz41By
 kSU0Hzj3iBHk+XlqBWSELOd34aCuvqXG/gcE7rWOj0qbJ5T6GKVRTQN5CbqMNutJ
 Udw0JDhImgYxNI5fsy7Stg/5IqOwhp/pDIpLOHXRnYpLb2rJ1kzvgz4B/eJAZCcc
 zmjxZBn1C2GLBJYFDbY1KeR5Tp6WZ9yok+wbXFiO1mpx5RsU7jIL64X/7+Zg0X84
 p3LlN/vBn1nr2DiB3+n/
 =pwxz
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs bugfixes from Jaegeuk Kim:
 "This includes a couple of bug fixes found by xfstests.  In addition,
  one critical bug was reported by Brian Chadwick, which is falling into
  the infinite loop in balance_dirty_pages.  And it turned out due to
  the IO merging policy in f2fs, which was newly merged in 3.16.

   - fix normal and recovery path for fallocated regions
   - fix error case mishandling
   - recover renamed fsync inodes correctly
   - fix to get out of infinite loops in balance_dirty_pages
   - fix kernel NULL pointer error"

* tag 'f2fs-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: avoid to access NULL pointer in issue_flush_thread
  f2fs: check bdi->dirty_exceeded when trying to skip data writes
  f2fs: do checkpoint for the renamed inode
  f2fs: release new entry page correctly in error path of f2fs_rename
  f2fs: fix error path in init_inode_metadata
  f2fs: check lower bound nid value in check_nid_range
  f2fs: remove unused variables in f2fs_sm_info
  f2fs: fix not to allocate unnecessary blocks during fallocate
  f2fs: recover fallocated data and its i_size together
  f2fs: fix to report newly allocate region as extent
2014-07-09 09:46:58 -07:00
Chao Yu
50e1f8d221 f2fs: avoid to access NULL pointer in issue_flush_thread
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=75861

Denis 2014-05-10 11:28:59 UTC reported:
"F2FS-fs (mmcblk0p28): mounting..
 Unable to handle kernel NULL pointer dereference at virtual address 00000018
 ...
 [<c0a2f678>] (_raw_spin_lock+0x3c/0x70) from [<c03a0330>] (issue_flush_thread+0x50/0x17c)
 [<c03a0330>] (issue_flush_thread+0x50/0x17c) from [<c01b4064>] (kthread+0x98/0xa4)
 [<c01b4064>] (kthread+0x98/0xa4) from [<c0108060>] (kernel_thread_exit+0x0/0x8)"

This patch assign cmd_control_info in sm_info before issue_flush_thread is being
created, so this make sure that issue flush thread will have no chance to access
invalid info in fcc.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:55 -07:00
Jaegeuk Kim
2743f86554 f2fs: check bdi->dirty_exceeded when trying to skip data writes
If we don't check the current backing device status, balance_dirty_pages can
fall into infinite pausing routine.

This can be occurred when a lot of directories make a small number of dirty
dentry pages including files.

Reported-by: Brian Chadwick <brianchad@westnet.com.au>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:45 -07:00
Jaegeuk Kim
b2c0829912 f2fs: do checkpoint for the renamed inode
If an inode is renamed, it should be registered as file_lost_pino to conduct
checkpoint at f2fs_sync_file.
Otherwise, the inode cannot be recovered due to no dent_mark in the following
scenario.

Note that, this scenario is from xfstests/322.

1. create "a"
2. fsync "a"
3. rename "a" to "b"
4. fsync "b"
5. Sudden power-cut

After recovery is done, "b" should be seen.
However, the result shows "a", since the recovery procedure does not enter
recover_dentry due to no dent_mark.

The reason is like below.
- The nid of "a" is checkpointed during #2, f2fs_sync_file.
- The inode page for "b" produced by #3 is written without dent_mark by
sync_node_pages.

So, this patch fixes this bug by assinging file_lost_pino to the "a"'s inode.
If the pino is lost, f2fs_sync_file conducts checkpoint, and then recovers
the latest pino and its dentry information for further recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:31 -07:00
Chao Yu
dd4d961fe7 f2fs: release new entry page correctly in error path of f2fs_rename
This patch correct releasing code of new_page to avoid BUG_ON in error patch of
f2fs_rename.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:11 -07:00
Chao Yu
90d72459cc f2fs: fix error path in init_inode_metadata
If we fail in this path:
->init_inode_metadata
  ->make_empty_dir
    ->get_new_data_page
      ->grab_cache_page return -ENOMEM

We will bug on in error path of init_inode_metadata when call remove_inode_page
because i_block = 2 (one inode block will be released later & one dentry block).

We should release the dentry block in init_inode_metadata to avoid this BUG_ON,
and avoid leak of dentry block resource, because we never have second chance to
release that block in ->evict_inode as in upper error path we make this inode
'bad'.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:58:50 -07:00
Chao Yu
d6b7d4b31d f2fs: check lower bound nid value in check_nid_range
This patch add lower bound verification for nid in check_nid_range, so nids
reserved like 0, node, meta passed by caller could be checked there.

And then check_nid_range could be used in f2fs_nfs_get_inode for simplifying
code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:58:08 -07:00
Chao Yu
8bc6f60e3f f2fs: remove unused variables in f2fs_sm_info
Remove unused variables in struct f2fs_sm_info.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:57:57 -07:00
Christoph Hellwig
74adf83f5d nfs: only show Posix ACLs in listxattr if actually present
The big ACL switched nfs to use generic_listxattr, which calls all existing
->list handlers.  Add a custom .listxattr implementation that only lists
the ACLs if they actually are present on the given inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Philippe Troin <phil@fifi.org>
Tested-by: Philippe Troin <phil@fifi.org>
Fixes: 013cdf1088 (nfs: use generic posix ACL infrastructure ...)
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-08 14:36:08 -04:00
Kinglong Mee
c3a4561796 nfsd: Fix bad reserving space for encoding rdattr_error
Introduced by commit 561f0ed498 (nfsd4: allow large readdirs).

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-07 14:16:31 -04:00
Miklos Szeredi
c55a01d360 fuse: avoid scheduling while atomic
As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
triggers complains about scheduling while atomic:

  BUG: scheduling while atomic: fuse.hf/13976/0x10000001

This happens because fuse_notify_inval_entry() attempts to allocate memory
with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().

Introduced by commit 58bda1da4b "fuse/dev: use atomic maps"

Fix by moving the map/unmap to just cover the actual memcpy operation.

Original patch from Maxim Patlasov <mpatlasov@parallels.com>

Reported-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.15+
2014-07-07 15:28:51 +02:00
Miklos Szeredi
233a01fa9c fuse: handle large user and group ID
If the number in "user_id=N" or "group_id=N" mount options was larger than
INT_MAX then fuse returned EINVAL.

Fix this to handle all valid uid/gid values.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2014-07-07 15:28:51 +02:00
Himangi Saraogi
7b3d8bf771 fuse: inode: drop cast
This patch removes the cast on data of type void * as it is not needed.
The following Coccinelle semantic patch was used for making the change:

@r@
expression x;
void* e;
type T;
identifier f;
@@

(
  *((T *)e)
|
  ((T *)x)[...]
|
  ((T *)x)->f
|
- (T *)
  e
)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-07 15:28:51 +02:00
Anand Avati
154210ccb3 fuse: ignore entry-timeout on LOOKUP_REVAL
The following test case demonstrates the bug:

  sh# mount -t glusterfs localhost:meta-test /mnt/one

  sh# mount -t glusterfs localhost:meta-test /mnt/two

  sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
  bash: /mnt/one/file: Stale file handle

  sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file

On the second open() on /mnt/one, FUSE would have used the old
nodeid (file handle) trying to re-open it. Gluster is returning
-ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
where lookup is re-attempted with LOOKUP_REVAL. The right
behavior now, would be for FUSE to ignore the entry-timeout and
and do the up-call revalidation. Instead FUSE is ignoring
LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
has not passed), and open() is again retried on the old file
handle and finally the ESTALE is going back to the application.

Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
entry-timeout and always do the up-call.

Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2014-07-07 15:28:51 +02:00
Miklos Szeredi
126b9d4365 fuse: timeout comparison fix
As suggested by checkpatch.pl, use time_before64() instead of direct
comparison of jiffies64 values.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org>
2014-07-07 15:28:50 +02:00
Eric Sandeen
5dd214248f ext4: disable synchronous transaction batching if max_batch_time==0
The mount manpage says of the max_batch_time option,

	This optimization can be turned off entirely
	by setting max_batch_time to 0.

But the code doesn't do that.  So fix the code to do
that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:18:22 -04:00
Theodore Ts'o
94d4c066a4 ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
We are spending a lot of time explaining to users what this error
means.  Let's try to improve the message to avoid this problem.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:15:50 -04:00
Theodore Ts'o
ae0f78de2c ext4: clarify error count warning messages
Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2014-07-05 18:40:52 -04:00
Theodore Ts'o
61c219f581 ext4: fix unjournalled bg descriptor while initializing inode bitmap
The first time that we allocate from an uninitialized inode allocation
bitmap, if the block allocation bitmap is also uninitalized, we need
to get write access to the block group descriptor before we start
modifying the block group descriptor flags and updating the free block
count, etc.  Otherwise, there is the potential of a bad journal
checksum (if journal checksums are enabled), and of the file system
becoming inconsistent if we crash at exactly the wrong time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 16:28:35 -04:00
Linus Torvalds
b82207b8e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've queued up a few fixes in my for-linus branch"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix crash when starting transaction
  Btrfs: fix btrfs_print_leaf for skinny metadata
  Btrfs: fix race of using total_bytes_pinned
  btrfs: use E2BIG instead of EIO if compression does not help
  btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
  Btrfs: fix use-after-free when cloning a trailing file hole
  btrfs: fix null pointer dereference in btrfs_show_devname when name is null
  btrfs: fix null pointer dereference in clone_fs_devices when name is null
  btrfs: fix nossd and ssd_spread mount option regression
  Btrfs: fix race between balance recovery and root deletion
  Btrfs: atomically set inode->i_flags in btrfs_update_iflags
  btrfs: only unlock block in verify_parent_transid if we locked it
  Btrfs: assert send doesn't attempt to start transactions
  btrfs compression: reuse recently used workspace
  Btrfs: fix crash when mounting raid5 btrfs with missing disks
  btrfs: create sprout should rename fsid on the sysfs as well
  btrfs: dev replace should replace the sysfs entry
  btrfs: dev add should add its sysfs entry
  btrfs: dev delete should remove sysfs entry
  btrfs: rename add_device_membership to btrfs_kobj_add_device
2014-07-04 08:53:53 -07:00
Linus Torvalds
3089f54a79 Driver core fixes for 3.16-rc4
Well, one drivercore fix for kernfs to resolve a reported issue with
 sysfs files being updated from atomic contexts, and another lz4 bugfix
 for testing potential buffer overflows.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlO1/FEACgkQMUfUDdst+ynRPACfWcssJKICc2N7g9/0XXGVTjVT
 PwwAnjQ8bjOfu6i2z/lViLtZGjOnzKor
 =qtjB
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Well, one drivercore fix for kernfs to resolve a reported issue with
  sysfs files being updated from atomic contexts, and another lz4 bugfix
  for testing potential buffer overflows"

* tag 'driver-core-3.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  lz4: add overrun checks to lz4_uncompress_unknownoutputsize()
  kernfs: kernfs_notify() must be useable from non-sleepable contexts
2014-07-03 18:53:13 -07:00
Linus Torvalds
0fba687f9b Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "By coincidence, two NFSv4 symlink bugs, one introduced in the 3.16 xdr
  encoding rewrite, the other a decoding bug that I think we've had
  since the start but that just doesn't trigger very often"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  nfs: fix nfs4d readlink truncated packet
  nfsd: fix rare symlink decoding bug
2014-07-03 18:33:22 -07:00
Heiko Carstens
058504edd0 fs/seq_file: fallback to vmalloc allocation
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.

E.g.  for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation.  In such
situations reading /proc/stat is not possible anymore.

Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.

For reference a call trace where reading from /proc/stat failed:

  sadc: page allocation failure: order:4, mode:0x1040d0
  CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
  [...]
  Call Trace:
    show_stack+0x6c/0xe8
    warn_alloc_failed+0xd6/0x138
    __alloc_pages_nodemask+0x9da/0xb68
    __get_free_pages+0x2e/0x58
    kmalloc_order_trace+0x44/0xc0
    stat_open+0x5a/0xd8
    proc_reg_open+0x8a/0x140
    do_dentry_open+0x1bc/0x2c8
    finish_open+0x46/0x60
    do_last+0x382/0x10d0
    path_openat+0xc8/0x4f8
    do_filp_open+0x46/0xa8
    do_sys_open+0x114/0x1f0
    sysc_tracego+0x14/0x1a

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03 09:21:54 -07:00
Heiko Carstens
f74373a5cc /proc/stat: convert to single_open_size()
These two patches are supposed to "fix" failed order-4 memory
allocations which have been observed when reading /proc/stat.  The
problem has been observed on s390 as well as on x86.

To address the problem change the seq_file memory allocations to
fallback to use vmalloc, so that allocations also work if memory is
fragmented.

This approach seems to be simpler and less intrusive than changing
/proc/stat to use an interator.  Also it "fixes" other users as well,
which use seq_file's single_open() interface.

This patch (of 2):

Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it.  Also
calculate the requested size using the number of online cpus instead of
possible cpus, since the size of the output only depends on the number
of online cpus.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03 09:21:54 -07:00
Ian Kent
571ff4731b autofs4: fix false positive compile error
On strict build environments we can see:

  fs/autofs4/inode.c: In function 'autofs4_fill_super':
  fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this function
  make[2]: *** [fs/autofs4/inode.o] Error 1
  make[1]: *** [fs/autofs4] Error 2
  make: *** [fs] Error 2
  make: *** Waiting for unfinished jobs....

This is due to the use of pgrp_set being used to indicate pgrp has has
been set rather than initializing pgrp itself.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03 09:21:53 -07:00
Filipe Manana
abdd2e80a5 Btrfs: fix crash when starting transaction
Often when starting a transaction we commit the currently running transaction,
which can end up writing block group caches when the current process has its
journal_info set to NULL (and not to a transaction). This makes our assertion
at btrfs_check_data_free_space() (current_journal != NULL) fail, resulting
in a crash/hang. Therefore fix it by setting journal_info.

Two different traces of this issue follow below.

1)

    [51502.241936] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [51502.242213] ------------[ cut here ]------------
    [51502.242493] kernel BUG at fs/btrfs/ctree.h:3964!
    [51502.242669] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [51502.244010] Call Trace:
    [51502.244010]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [51502.244010]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [51502.244010]  [<ffffffffa0357a6a>] commit_cowonly_roots+0x164/0x226 [btrfs]
    [51502.244010]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [51502.244010]  [<ffffffff8168ec7b>] ? _raw_spin_unlock+0x2b/0x40
    [51502.244010]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
    [51502.244010]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [51502.244010]  [<ffffffffa02d73e1>] __unlink_start_trans+0x31/0xe0 [btrfs]
    [51502.244010]  [<ffffffffa02dea67>] btrfs_unlink+0x37/0xc0 [btrfs]
    [51502.244010]  [<ffffffff811bb054>] ? do_unlinkat+0x114/0x2a0
    [51502.244010]  [<ffffffff811baebc>] vfs_unlink+0xcc/0x150
    [51502.244010]  [<ffffffff811bb1a0>] do_unlinkat+0x260/0x2a0
    [51502.244010]  [<ffffffff811a9ef4>] ? filp_close+0x64/0x90
    [51502.244010]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
    [51502.244010]  [<ffffffff81349cab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [51502.244010]  [<ffffffff811be9eb>] SyS_unlinkat+0x1b/0x40
    [51502.244010]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
    [51502.244010] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 71 13 36 a0 48 89 fe 31 c0 48 c7 c7 b8 43 36 a0 48 89 e5 e8 5d b0 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [51502.244010] RIP  [<ffffffffa03575da>] assfail.constprop.88+0x1e/0x20 [btrfs]

2)

    [25405.097230] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [25405.097488] ------------[ cut here ]------------
    [25405.097767] kernel BUG at fs/btrfs/ctree.h:3964!
    [25405.097940] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [25405.100008] Call Trace:
    [25405.100008]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [25405.100008]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [25405.100008]  [<ffffffffa035755a>] commit_cowonly_roots+0x164/0x226 [btrfs]
    [25405.100008]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [25405.100008]  [<ffffffff8109c170>] ? bit_waitqueue+0xc0/0xc0
    [25405.100008]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
    [25405.100008]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [25405.100008]  [<ffffffffa02e3407>] btrfs_create+0x47/0x210 [btrfs]
    [25405.100008]  [<ffffffffa02d74cc>] ? btrfs_permission+0x3c/0x80 [btrfs]
    [25405.100008]  [<ffffffff811bc63b>] vfs_create+0x9b/0x130
    [25405.100008]  [<ffffffff811bcf19>] do_last+0x849/0xe20
    [25405.100008]  [<ffffffff811b9409>] ? link_path_walk+0x79/0x820
    [25405.100008]  [<ffffffff811bd5b5>] path_openat+0xc5/0x690
    [25405.100008]  [<ffffffff810ab07d>] ? trace_hardirqs_on+0xd/0x10
    [25405.100008]  [<ffffffff811cdcd2>] ? __alloc_fd+0x32/0x1d0
    [25405.100008]  [<ffffffff811be2a3>] do_filp_open+0x43/0xa0
    [25405.100008]  [<ffffffff811cddf1>] ? __alloc_fd+0x151/0x1d0
    [25405.100008]  [<ffffffff811abcfc>] do_sys_open+0x13c/0x230
    [25405.100008]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
    [25405.100008]  [<ffffffff811abe12>] SyS_open+0x22/0x30
    [25405.100008]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
    [25405.100008] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 51 13 36 a0 48 89 fe 31 c0 48 c7 c7 d0 43 36 a0 48 89 e5 e8 6d b5 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [25405.100008] RIP  [<ffffffffa03570ca>] assfail.constprop.88+0x1e/0x20 [btrfs]

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:18 -07:00
Josef Bacik
be2c765dff Btrfs: fix btrfs_print_leaf for skinny metadata
We wouldn't actuall print the extent information if we had a skinny metadata
item, this fixes that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:16 -07:00
Liu Bo
d288db5dc0 Btrfs: fix race of using total_bytes_pinned
This percpu counter @total_bytes_pinned is introduced to skip unnecessary
operations of 'commit transaction', it accounts for those space we may free
but are stuck in delayed refs.

And we zero out @space_info->total_bytes_pinned every transaction period so
we have a better idea of how much space we'll actually free up by committing
this transaction.  However, we do the 'zero out' part a little earlier, before
we actually unpin space, so we end up returning ENOSPC when we actually have
free space that's just unpinned from committing transaction.

xfstests/generic/074 complained then.

This fixes it by actually accounting the percpu pinned number when 'unpin',
and since it's protected by space_info->lock, the race is gone now.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:15 -07:00
David Sterba
130d5b415a btrfs: use E2BIG instead of EIO if compression does not help
Return codes got updated in 60e1975acb
(btrfs: return errno instead of -1 from compression)
lzo wrapper returns E2BIG in this case, do the same for zlib.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03 07:04:13 -07:00
David Sterba
0a4eaea892 btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
Commit fcebe4562d (Btrfs: rework qgroup
accounting) removed the qgroup accounting after delayed refs.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03 07:04:12 -07:00
Filipe Manana
14f5979633 Btrfs: fix use-after-free when cloning a trailing file hole
The transaction handle was being used after being freed.

Cc: Chris Mason <clm@fb.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:10 -07:00
Anand Jain
0aeb8a6e67 btrfs: fix null pointer dereference in btrfs_show_devname when name is null
dev->name is null but missing flag is not set.
Strictly speaking the missing flag should have been set, but there
are more places where code just checks if name is null. For now this
patch does the same.

stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
IP: [<ffffffffa0228908>] btrfs_show_devname+0x58/0xf0 [btrfs]

[<ffffffff81198879>] show_vfsmnt+0x39/0x130
[<ffffffff81178056>] m_show+0x16/0x20
[<ffffffff8117d706>] seq_read+0x296/0x390
[<ffffffff8115aa7d>] vfs_read+0x9d/0x160
[<ffffffff8115b549>] SyS_read+0x49/0x90
[<ffffffff817abe52>] system_call_fastpath+0x16/0x1b

reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:09 -07:00
Anand Jain
e755f78086 btrfs: fix null pointer dereference in clone_fs_devices when name is null
when one of the device path is missing btrfs_device name is null. So this
patch will check for that.

stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff812e18c0>] strlen+0x0/0x30
[<ffffffffa01cd92a>] ? clone_fs_devices+0xaa/0x160 [btrfs]
[<ffffffffa01cdcf7>] btrfs_init_new_device+0x317/0xca0 [btrfs]
[<ffffffff81155bca>] ? __kmalloc_track_caller+0x15a/0x1a0
[<ffffffffa01d6473>] btrfs_ioctl+0xaa3/0x2860 [btrfs]
[<ffffffff81132a6c>] ? handle_mm_fault+0x48c/0x9c0
[<ffffffff81192a61>] ? __blkdev_put+0x171/0x180
[<ffffffff817a784c>] ? __do_page_fault+0x4ac/0x590
[<ffffffff81193426>] ? blkdev_put+0x106/0x110
[<ffffffff81179175>] ? mntput+0x35/0x40
[<ffffffff8116d4b0>] do_vfs_ioctl+0x460/0x4a0
[<ffffffff8115c72e>] ? ____fput+0xe/0x10
[<ffffffff81068033>] ? task_work_run+0xb3/0xd0
[<ffffffff8116d547>] SyS_ioctl+0x57/0x90
[<ffffffff817a793e>] ? do_page_fault+0xe/0x10
[<ffffffff817abe52>] system_call_fastpath+0x16/0x1b

reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:07 -07:00
Eric Sandeen
2aa06a35d0 btrfs: fix nossd and ssd_spread mount option regression
The commit

0780253 btrfs: Cleanup the btrfs_parse_options for remount.

broke ssd options quite badly; it stopped making ssd_spread
imply ssd, and it made "nossd" unsettable.

Put things back at least as well as they were before
(though ssd mount option handling is still pretty odd:
# mount -o "nossd,ssd_spread" works?)

Reported-by: Roman Mamedov <rm@romanrm.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:06 -07:00
Wang Shilong
5f3164813b Btrfs: fix race between balance recovery and root deletion
Balance recovery is called when RW mounting or remounting from
RO to RW, it is called to finish roots merging.

When doing balance recovery, relocation root's corresponding
fs root(whose root refs is 0) might be destroyed by cleaner
thread, this will make btrfs fail to mount.

Fix this problem by holding @cleaner_mutex when doing balance
recovery.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:04 -07:00
Filipe Manana
3cc7939255 Btrfs: atomically set inode->i_flags in btrfs_update_iflags
This change is based on the corresponding recent change for ext4:

  ext4: atomically set inode->i_flags in ext4_set_inode_flags()

That has the following commit message that applies to btrfs as well:

  "Use cmpxchg() to atomically set i_flags instead of clearing out the
   S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
   EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
   where an immutable file has the immutable flag cleared for a brief
   window of time."

Replacing EXT4_IMMUTABLE_FL and EXT4_APPEND_FL with BTRFS_INODE_IMMUTABLE
and BTRFS_INODE_APPEND, respectively.

Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:03:23 -07:00
Avi Kivity
69bbd9c7b9 nfs: fix nfs4d readlink truncated packet
XDR requires 4-byte alignment; nfs4d READLINK reply writes out the padding,
but truncates the packet to the padding-less size.

Fix by taking the padding into consideration when truncating the packet.

Symptoms:

	# ll /mnt/
	ls: cannot read symbolic link /mnt/test: Input/output error
	total 4
	-rw-r--r--. 1 root root  0 Jun 14 01:21 123456
	lrwxrwxrwx. 1 root root  6 Jul  2 03:33 test
	drwxr-xr-x. 1 root root  0 Jul  2 23:50 tmp
	drwxr-xr-x. 1 root root 60 Jul  2 23:44 tree

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Fixes: 476a7b1f4b (nfsd4: don't treat readlink like a zero-copy operation)
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-02 17:37:13 -04:00
Tejun Heo
ecca47ce82 kernfs: kernfs_notify() must be useable from non-sleepable contexts
d911d98748 ("kernfs: make kernfs_notify() trigger inotify events
too") added fsnotify triggering to kernfs_notify() which requires a
sleepable context.  There are already existing users of
kernfs_notify() which invoke it from an atomic context and in general
it's silly to require a sleepable context for triggering a
notification.

The following is an invalid context bug triggerd by md invoking
sysfs_notify() from IO completion path.

 BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
 2 locks held by swapper/1/0:
  #0:  (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk]
  #1:  (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240
 irq event stamp: 33518
 hardirqs last  enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230
 hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72
 softirqs last  enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50
 softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
  0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
  0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
 Call Trace:
  <IRQ>  [<ffffffff81807b4c>] dump_stack+0x4d/0x66
  [<ffffffff810d4f14>] __might_sleep+0x184/0x240
  [<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440
  [<ffffffff812d76a0>] kernfs_notify+0x90/0x150
  [<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240
  [<ffffffffa00de863>] close_write+0x93/0xb0 [raid1]
  [<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1]
  [<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1]
  [<ffffffff813acb8b>] bio_endio+0x6b/0xa0
  [<ffffffff813b46c4>] blk_update_request+0x94/0x420
  [<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70
  [<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk]
  [<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120
  [<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30
  [<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk]
  [<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring]
  [<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340
  [<ffffffff8111647d>] handle_irq_event+0x3d/0x60
  [<ffffffff81119436>] handle_edge_irq+0x66/0x130
  [<ffffffff8101c3e4>] handle_irq+0x84/0x150
  [<ffffffff818146ad>] do_IRQ+0x4d/0xe0
  [<ffffffff818122f2>] common_interrupt+0x72/0x72
  <EOI>  [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10
  [<ffffffff81025454>] default_idle+0x24/0x230
  [<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20
  [<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0
  [<ffffffff8104df1b>] start_secondary+0x25b/0x300

This patch fixes it by punting the notification delivery through a
work item.  This ends up adding an extra pointer to kernfs_elem_attr
enlarging kernfs_node by a pointer, which is not ideal but not a very
big deal either.  If this turns out to be an actual issue, we can move
kernfs_elem_attr->size to kernfs_node->iattr later.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-02 09:32:09 -07:00
Li Zefan
4e26445faa kernfs: introduce kernfs_pin_sb()
kernfs_pin_sb() tries to get a refcnt of the superblock.

This will be used by cgroupfs.

v2:
- make kernfs_pin_sb() return the superblock.
- drop kernfs_drop_sb().

tj: Updated the comment a bit.

[ This is a prerequisite for a bugfix. ]
Cc: <stable@vger.kernel.org> # 3.15
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-30 10:16:25 -04:00
Linus Torvalds
16874b2cb8 Fix a regression when trying to compile ext4 on older versions gcc.
Fix a number of miscellaneous bugs for punch hole as well as a
 long-standing potential double buffer head release when failing a
 block allocation for an indirect-mapped file.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTsKRpAAoJENNvdpvBGATwf3YQAJoGNvrxd6BPpB0bSTLRdh+k
 0IDShwB9PpSJulIHx8xIOpFIvpk5T0E1Ho+UqUnpMImRiqueYbFfOPca4lYsRBqW
 r9TNmm0O8Bf8K0j8YUaV2P8BGNJuiCzv7YcCbSHt1/eP6/InyfJWA17hYD6oxBPF
 ZrrcZemDH863KhYIF55sbx9UvBLz515ifd4kHN9pSVbV3pJT5/zPiRk/wujQZTrX
 v0z5pe5i8GfYoBMmdCNak3a1YoXdf+FUdID+pvGWtfzs8AG8nS7jb8C1zkfgUWtC
 zau9yBnFHKlBdmoPrFJIpvDT/2rBRks6g6uM2Jsc+YS+zHCXi0xJR1EaPIqkRJbf
 vWHNSogzOKetyFZFML1Dg1cjXVEnPusOsyhvcXTFwb3n/YY1NBFdLiVgNbTzQ7aU
 X7P4M+ca2yib2rQos5Ltipk4ju9eT4d5qsf+QSaaryYeUHg+8sNaqd82y42eita1
 Dgg7IpyKWdNo+lBHEtDSV5Gli4oRrXddj7Yvrib7nobf+FTi4cpbbu0PfY5qXfDV
 vKrsvIiJcPJYsvi+USH3fvv2m6UaZd+gDo/RObV06tSNTZWqGzalSeviaJZ+cfEP
 v05ODGLUNvg+thhvBsYAGNeaV/SOYztmNA6v0qfI+OwscH8ycGeb6R98Kil/hd97
 P3i12fhP7tkffNTNPM/E
 =X4vY
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Fix a regression when trying to compile ext4 on older versions gcc.

  Fix a number of miscellaneous bugs for punch hole as well as a
  long-standing potential double buffer head release when failing a
  block allocation for an indirect-mapped file"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix hole punching for files with indirect blocks
  ext4: Fix block zeroing when punching holes in indirect block files
  ext4: decrement free clusters/inodes counters when block group declared bad
  fs/mbcache: replace __builtin_log2() with ilog2()
  ext4: Fix buffer double free in ext4_alloc_branch()
2014-06-29 19:20:43 -07:00
Josef Bacik
472b909ff6 btrfs: only unlock block in verify_parent_transid if we locked it
This is a regression from my patch a26e8c9f75, we
need to only unlock the block if we were the one who locked it.  Otherwise this
will trip BUG_ON()'s in locking.c  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:47 -07:00
Filipe Manana
46c4e71e9b Btrfs: assert send doesn't attempt to start transactions
When starting a transaction just assert that current->journal_info
doesn't contain a send transaction stub, since send isn't supposed
to start transactions and when it finishes (either successfully or
not) it's supposed to set current->journal_info to NULL.

This is motivated by the change titled:

    Btrfs: fix crash when starting transaction

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:46 -07:00
Sergey Senozhatsky
c39aa7056f btrfs compression: reuse recently used workspace
Add compression `workspace' in free_workspace() to
`idle_workspace' list head, instead of tail. So we have
better chances to reuse most recently used `workspace'.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:46 -07:00
Liu Bo
5588383ece Btrfs: fix crash when mounting raid5 btrfs with missing disks
The reproducer is

$ mkfs.btrfs D1 D2 D3 -mraid5
$ mkfs.ext4 D2 && mkfs.ext4 D3
$ mount D1 /btrfs -odegraded

-------------------

[   87.672992] ------------[ cut here ]------------
[   87.673845] kernel BUG at fs/btrfs/raid56.c:1828!
...
[   87.673845] RIP: 0010:[<ffffffff813efc7e>]  [<ffffffff813efc7e>] __raid_recover_end_io+0x4ae/0x4d0
...
[   87.673845] Call Trace:
[   87.673845]  [<ffffffff8116bbc6>] ? mempool_free+0x36/0xa0
[   87.673845]  [<ffffffff813f0255>] raid_recover_end_io+0x75/0xa0
[   87.673845]  [<ffffffff81447c5b>] bio_endio+0x5b/0xa0
[   87.673845]  [<ffffffff81447cb2>] bio_endio_nodec+0x12/0x20
[   87.673845]  [<ffffffff81374621>] end_workqueue_fn+0x41/0x50
[   87.673845]  [<ffffffff813ad2aa>] normal_work_helper+0xca/0x2c0
[   87.673845]  [<ffffffff8108ba2b>] process_one_work+0x1eb/0x530
[   87.673845]  [<ffffffff8108b9c9>] ? process_one_work+0x189/0x530
[   87.673845]  [<ffffffff8108c15b>] worker_thread+0x11b/0x4f0
[   87.673845]  [<ffffffff8108c040>] ? rescuer_thread+0x290/0x290
[   87.673845]  [<ffffffff810939c4>] kthread+0xe4/0x100
[   87.673845]  [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220
[   87.673845]  [<ffffffff817e7c7c>] ret_from_fork+0x7c/0xb0
[   87.673845]  [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220

-------------------

It's because that we miscalculate @rbio->bbio->error so that it doesn't
reach maximum of tolerable errors while it should have.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Satoru Takeuchi<takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:45 -07:00
Anand Jain
b2373f255c btrfs: create sprout should rename fsid on the sysfs as well
Creating sprout will change the fsid of the mounted root.
do the same on the sysfs as well.

reproducer:
 mount /dev/sdb /btrfs (seed disk)
 btrfs dev add /dev/sdc /btrfs
 mount -o rw,remount /btrfs
 btrfs dev del /dev/sdb /btrfs
 mount /dev/sdb /btrfs

Error:
kobject_add_internal failed for fe350492-dc28-4051-a601-e017b17e6145 with -EEXIST, don't try to register things with the same name in the same directory.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:44 -07:00
Anand Jain
49c6f736f3 btrfs: dev replace should replace the sysfs entry
when we replace the device its corresponding sysfs
entry has to be replaced as well

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:44 -07:00
Anand Jain
0d39376aa2 btrfs: dev add should add its sysfs entry
we would need the device links to be created,
when device is added.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:43 -07:00
Anand Jain
99994cde9c btrfs: dev delete should remove sysfs entry
when we delete the device from the mounted btrfs,
we would need its corresponding sysfs enty to
be removed as well.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:42 -07:00
Anand Jain
9b4eaf43f4 btrfs: rename add_device_membership to btrfs_kobj_add_device
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:41 -07:00
Tejun Heo
9a1049da9b percpu-refcount: require percpu_ref to be exited explicitly
Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.

* It's impossible to re-init a released reference counter without
  going through re-allocation.

* In the similar vein, it's impossible to initialize a percpu_ref
  count with static percpu variables.

* We need and have an explicit destructor anyway for failure paths -
  percpu_ref_cancel_init().

This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().

All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Li Zefan <lizefan@huawei.com>
2014-06-28 08:10:14 -04:00
Tejun Heo
55c6c814ae percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
ioctx_alloc() reaches inside percpu_ref and directly frees
->pcpu_count in its failure path, which is quite gross.  percpu_ref
has been providing a proper interface to do this,
percpu_ref_cancel_init(), for quite some time now.  Let's use that
instead.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
2014-06-28 08:10:12 -04:00
J. Bruce Fields
76f47128f9 nfsd: fix rare symlink decoding bug
An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.

The vfs, on the other hand, wants null-terminated data.

The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.

The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.

But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.

Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.

In the NFSv2/v3 cases this actually appears to be safe:

	- nfs3svc_decode_symlinkargs explicitly null-terminates the data
	  (after first checking its length and copying it to a new
	  page).
	- NFSv2 limits symlinks to 1k.  The buffer holding the rpc
	  request is always at least a page, and the link data (and
	  previous fields) have maximum lengths that prevent the request
	  from reaching the end of a page.

In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.

The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though.  It should
really either do the copy itself every time or just require a
null-terminated string.

Reported-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-27 16:10:46 -04:00
Jan Kara
a93cd4cf86 ext4: Fix hole punching for files with indirect blocks
Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

CC: stable@vger.kernel.org
Fixes: 8bad6fc813
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26 12:30:54 -04:00
Jan Kara
77ea2a4ba6 ext4: Fix block zeroing when punching holes in indirect block files
free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26 12:28:57 -04:00
Namjae Jeon
e43bb4e612 ext4: decrement free clusters/inodes counters when block group declared bad
We should decrement free clusters counter when block bitmap is marked
as corrupt and free inodes counter when the allocation bitmap is
marked as corrupt to avoid misunderstanding due to incorrect available
size in statfs result.  User can get immediately ENOSPC error from
write begin without reaching for the writepages.

Cc: Darrick J. Wong<darrick.wong@oracle.com>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2014-06-26 10:11:53 -04:00
Linus Torvalds
d7933ab727 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Small set of misc cifs/smb3 fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option
  cifs: revalidate mapping prior to satisfying read_iter request with cache=loose
  fs/cifs: fix regression in cifs_create_mf_symlink()
2014-06-25 21:47:28 -07:00
Linus Torvalds
ec71feae06 NFS client fixes for Linux 3.16
Highlights include:
 
 - Stable fix for a data corruption case due to incorrect cache validation
 - Fix a couple of false positive cache invalidations
 - Fix NFSv4 security negotiation issues
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTq1GIAAoJEGcL54qWCgDyCfUP/3S7Py5Gocdqvb7FBPpCWtsb
 PJlv1RjC4ngT+BJpBeDSOFEcZerfeQAwguL5kEgIjdyKmsAjVjIF7ThagNQK/0yr
 qpeKh2EtbAipjjXVmul7saG3Ucuv/PggEhqGl9iJK0QyPdmnr30cHGHHt3kCIPGE
 e4AkaCN4ZuXBdDOO4YpKzIl6wQPb0Gjwps1boW4INCvnBvK6Yno26Q6ilDf92gJE
 hisEn0l8l09C6t2jZKP7daCyGForTYYlMxIbmjmQhsMEwnh1kmfpr/xuAQP2bflr
 14OFrNbrZg3p4ucp8g7EzgS1Z5m/Ism0xNKfO4LgNwUobSgbvvvScAC3/LP2HIIk
 RXuRhgb8u6pbWQRqq4XznB+csh6DGR/ui2PhonK4lJDaJxcU3bnFlhTgoC0GSyCa
 Wbbdv+nhXhw5Xi9jsma6PW/CnHJH6sk/8KviRPOpC+RsCg+X41vTHzC4XvWbentw
 aZGkNuWAnBKMyswu08E4+ScFQxToSB6ju4RjOsTTMleC0ewWXD3Y6FL+B5p4crPO
 L05KCLkP+SeRxpakOM3e/x/bkVOa+DBna7foXUZ9snWybYoOmuxOkJgJT7bxrYaA
 /3N0e/WUUgPR/bhdydMJSRo6DchKj+5GRSpx8FB9eMqqp8mNE+I61/Kq0dFEbtPQ
 1IQCFT4w1PEegDpwjb0L
 =o+QR
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights include:

   - Stable fix for a data corruption case due to incorrect cache
     validation
   - Fix a couple of false positive cache invalidations
   - Fix NFSv4 security negotiation issues"

* tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for support
  NFS Return -EPERM if no supported or matching SECINFO flavor
  NFS check the return of nfs4_negotiate_security in nfs4_submount
  NFS: Don't mark the data cache as invalid if it has been flushed
  NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file size
  nfs: Fix cache_validity check in nfs_write_pageuptodate()
2014-06-25 20:06:06 -07:00
T Makphaibulchoke
ec7756ae15 fs/mbcache: replace __builtin_log2() with ilog2()
Fix compiler error with some gcc version(s) that do not
support __builtin_log2() by replacing __builtin_log2() with
ilog2().

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
2014-06-25 22:08:29 -04:00
Andy Adamson
66b0686049 NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for support
Fix nfs4_negotiate_security to create an rpc_clnt used to test each SECINFO
returned pseudoflavor. Check credential creation  (and gss_context creation)
which is important for RPC_AUTH_GSS pseudoflavors which can fail for multiple
reasons including mis-configuration.

Don't call nfs4_negotiate in nfs4_submount as it was just called by
nfs4_proc_lookup_mountpoint (nfs4_proc_lookup_common)

Signed-off-by: Andy Adamson <andros@netapp.com>
[Trond: fix corrupt return value from nfs_find_best_sec()]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:58 -04:00
Andy Adamson
8445cd3528 NFS Return -EPERM if no supported or matching SECINFO flavor
Do not return RPC_AUTH_UNIX if SEINFO reply tests fail. This
prevents an infinite loop of NFS4ERR_WRONGSEC for non RPC_AUTH_UNIX mounts.

Without this patch, a mount with no sec= option to a server
that does not include RPC_AUTH_UNIX in the
SECINFO return can be presented with an attemtp to use RPC_AUTH_UNIX
which will result in an NFS4ERR_WRONG_SEC which will prompt the SECINFO
call which will again try RPC_AUTH_UNIX....

Signed-off-by: Andy Adamson <andros@netapp.com>
Tested-By: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:58 -04:00
Andy Adamson
57bbe3d7c1 NFS check the return of nfs4_negotiate_security in nfs4_submount
Signed-off-by: Andy Adamson <andros@netapp.com>
Tested-By: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:57 -04:00
Trond Myklebust
6edf96097b NFS: Don't mark the data cache as invalid if it has been flushed
Now that we have functions such as nfs_write_pageuptodate() that use
the cache_validity flags to check if the data cache is valid or not,
it is a little more important to keep the flags in sync with the
state of the data cache.
In particular, we'd like to ensure that if the data cache is empty, we
don't start marking it as needing revalidation.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:57 -04:00
Trond Myklebust
f2467b6f64 NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file size
In nfs_update_inode(), if the change attribute is seen to change on
the server, then we set NFS_INO_REVAL_PAGECACHE in order to make
sure that we check the file size.
However, if we also update the file size in the same function, we
don't need to check it again. So make sure that we clear the
NFS_INO_REVAL_PAGECACHE that was set earlier.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:57 -04:00
Scott Mayhew
18dd78c427 nfs: Fix cache_validity check in nfs_write_pageuptodate()
NFS_INO_INVALID_DATA cannot be ignored, even if we have a delegation.

We're still having some problems with data corruption when multiple
clients are appending to a file and those clients are being granted
write delegations on open.

To reproduce:

Client A:
vi /mnt/`hostname -s`
while :; do echo "XXXXXXXXXXXXXXX" >>/mnt/file; sleep $(( $RANDOM % 5 )); done

Client B:
vi /mnt/`hostname -s`
while :; do echo "YYYYYYYYYYYYYYY" >>/mnt/file; sleep $(( $RANDOM % 5 )); done

What's happening is that in nfs_update_inode() we're recognizing that
the file size has changed and we're setting NFS_INO_INVALID_DATA
accordingly, but then we ignore the cache_validity flags in
nfs_write_pageuptodate() because we have a delegation.  As a result,
in nfs_updatepage() we're extending the write to cover the full page
even though we've not read in the data to begin with.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: <stable@vger.kernel.org> # v3.11+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:56 -04:00
Benjamin LaHaise
edfbbf388f aio: fix kernel memory disclosure in io_getevents() introduced in v3.10
A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380be.  The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
2014-06-24 13:46:01 -04:00
Benjamin LaHaise
f8567a3845 aio: fix aio request leak when events are reaped by userspace
The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping.  Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
2014-06-24 13:32:27 -04:00
Steve French
ce36d9ab3b [CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option
When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ?
via the Unicode Windows to POSIX remap range) empty paths
(eg when we open "" to query the root of the SMB3 directory on mount) were not
null terminated so we sent garbarge as a path name on empty paths which caused
SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified.  mapchars is
particularly important since Unix Extensions for SMB3 are not supported (yet)

Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
2014-06-24 08:10:24 -05:00
Xue jiufei
ac4fef4d23 ocfs2/dlm: do not purge lockres that is queued for assert master
When workqueue is delayed, it may occur that a lockres is purged while it
is still queued for master assert.  it may trigger BUG() as follows.

N1                                         N2
dlm_get_lockres()
->dlm_do_master_requery
                                  is the master of lockres,
                                  so queue assert_master work

                                  dlm_thread() start running
                                  and purge the lockres

                                  dlm_assert_master_worker()
                                  send assert master message
                                  to other nodes
receiving the assert_master
message, set master to N2

dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID,
if it is RECOVERY lockres, it triggers the BUG().

Another BUG() is triggered when N3 become the new master and send
assert_master to N1, N1 will trigger the BUG() because owner doesn't
match.  So we should not purge lockres when it is queued for assert
master.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
jiangyiwen
b9aaac5a6b ocfs2: do not return DLM_MIGRATE_RESPONSE_MASTERY_REF to avoid endless,loop during umount
The following case may lead to endless loop during umount.

node A         node B               node C       node D
umount volume,
migrate lockres1
to B
                                                 want to lock lockres1,
                                                 send
                                                 MASTER_REQUEST_MSG
                                                 to C
                                    init block mle
               send
               MIGRATE_REQUEST_MSG
               to C
                                    find a block
                                    mle, and then
                                    return
                                    DLM_MIGRATE_RESPONSE_MASTERY_REF
                                    to B
               set C in refmap
                                    umount successfully
               try to umount, endless
               loop occurs when migrate
               lockres1 since C is in
               refmap

So we can fix this endless loop case by only returning
DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving
MIGRATE_REQUEST_MSG.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
jiangyiwen
595297a8f9 ocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ocfs2_mknod
When the call to ocfs2_add_entry() failed in ocfs2_symlink() and
ocfs2_mknod(), iput() will not be called during dput(dentry) because no
d_instantiate(), and this will lead to umount hung.

Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Yiwen Jiang
f7a14f32e7 ocfs2: fix a tiny race when running dirop_fileop_racer
When running dirop_fileop_racer we found a dead lock case.

2 nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
/race/16/1 in the filesystem, and let the inode number of dir 16 is less
than the inode number of dir race.

Node A                            Node B
mv /race/16/1 /race/
                                  right after Node A has got the
                                  EX mode of /race/16/, and tries to
                                  get EX mode of /race
                                  ls /race/16/

In this case, Node A has got the EX mode of /race/16/, and wants to get EX
mode of /race/.  Node B has got the PR mode of /race/, and wants to get
the PR mode of /race/16/.  Since EX and PR are mutually exclusive, dead
lock happens.

This patch fixes this case by locking in ancestor order before trying
inode number order.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Xue jiufei
a270c6d3c0 ocfs2/dlm: fix misuse of list_move_tail() in dlm_run_purge_list()
When a lockres in purge list but is still in use, it should be moved to
the tail of purge list.  dlm_thread will continue to check next lockres in
purge list.  However, code list_move_tail(&dlm->purge_list,
&lockres->purge) will do *no* movements, so dlm_thread will purge the same
lockres in this loop again and again.  If it is in use for a long time,
other lockres will not be processed.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Wengang Wang
8a8ad1c2f6 ocfs2: refcount: take rw_lock in ocfs2_reflink
This patch tries to fix this crash:

 #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
 #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
    [exception RIP: ocfs2_direct_IO_get_blocks+359]
    RIP: ffffffffa05dfa27  RSP: ffff88003c1cd7e8  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff88003c1cdaa8  RCX: 0000000000000000
    RDX: 000000000000000c  RSI: ffff880027a95000  RDI: ffff88003c79b540
    RBP: ffff88003c1cd858   R8: 0000000000000000   R9: ffffffff815f6ba0
    R10: 00000000000001c9  R11: 00000000000001c9  R12: ffff88002d271500
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
 #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
 #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
#10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
#11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
#12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
#13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
#14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
#15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
#16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
#17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
#18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
#19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
#20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159

It crashes at
        BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
in ocfs2_direct_IO_get_blocks.

ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
during the time before (or inside) ocfs2_prepare_inode_for_write() and after
ocfs2_direct_IO_get_blocks().

It can happen in this case:

Node A(which crashes)				Node B
------------------------                 ---------------------------
ocfs2_file_aio_write
  ocfs2_prepare_inode_for_write
    ocfs2_inode_lock
    ...
    ocfs2_inode_unlock
  #no refcount found
....					ocfs2_reflink
                                          ocfs2_inode_lock
                                          ...
                                          ocfs2_inode_unlock
                                          #now, refcount flag set on extent

                                        ...
                                        flush change to disk

ocfs2_direct_IO_get_blocks
  ocfs2_get_clusters
    #extent map miss
    #buffer_head miss
    read extents from disk
  found refcount flag on extent
  crash..

Fix:
Take rw_lock in ocfs2_reflink path

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Xue jiufei
b253bfd878 ocfs2: revert "ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously"
75f82eaa50 ("ocfs2: fix NULL pointer dereference when dismount and
ocfs2rec simultaneously") may cause umount hang while shutting down
truncate log.

The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
  -> free osb->recovery_map
-> ocfs2_truncate_shutdown
  -> lock global bitmap inode
    -> ocfs2_wait_for_recovery
          -> check whether osb->recovery_map->rm_used is zero

Because osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Tariq Saeed
27bf6305cf ocfs2: fix deadlock when two nodes are converting same lock from PR to EX and idletimeout closes conn
Orabug: 18639535

Two node cluster and both nodes hold a lock at PR level and both want to
convert to EX at the same time.  Master node 1 has sent BAST and then
closes the connection due to idletime out.  Node 0 receives BAST, sends
unlock req with cancel flag but gets error -ENOTCONN.  The problem is
this error is ignored in dlm_send_remote_unlock_request() on the
**incorrect** assumption that the master is dead.  See NOTE in comment
why it returns DLM_NORMAL.  Upon getting DLM_NORMAL, node 0 proceeds to
sends convert (without cancel flg) which fails with -ENOTCONN.  waits 5
sec and resends.

This time gets DLM_IVLOCKID from the master since lock not found in
grant, it had been moved to converting queue in response to conv PR->EX
req.  No way out.

Node 1 (master)				Node 0
==============				======

  lock mode PR				PR

  convert PR -> EX
  mv grant -> convert and que BAST
  ...
                     <-------- convert PR -> EX
  convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX))
  ...
                        BAST (want PR -> NL)
                     ------------------>
  ...
  idle timout, conn closed
                                ...
                                In response to BAST,
                                sends unlock with cancel convert flag
                                gets -ENOTCONN. Ignores and
                                sends remote convert request
                                gets -ENOTCONN, waits 5 Sec, retries
  ...
  reconnects
                   <----------------- convert req goes through on next try
  does not find lock on grant que
                   status DLM_IVLOCKID
                   ------------------>
  ...

No way out.  Fix is to keep retrying unlock with cancel flag until it
succeeds or the master dies.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
alex chen
5fb1beb069 ocfs2: should add inode into orphan dir after updating entry in ocfs2_rename()
There are two files a and b in dir /mnt/ocfs2.

    node A                           node B

  mv a b
  In ocfs2_rename(), after calling
  ocfs2_orphan_add(), the inode of
  file b will be added into orphan
  dir.

  If ocfs2_update_entry() fails,
  ocfs2_rename return error and mv
  operation fails. But file b still
  exists in the parent dir.

  ocfs2_queue_orphan_scan
   -> ocfs2_queue_recovery_completion
   -> ocfs2_complete_recovery
   -> ocfs2_recover_orphans
  The inode of the file b will be
  put with iput().

  ocfs2_evict_inode
   -> ocfs2_delete_inode
   -> ocfs2_wipe_inode
   -> ocfs2_remove_inode
  OCFS2_VALID_FL in the inode
  i_flags will be cleared.

                                   The file b still can be accessed
                                   on node B.
                                   ls /mnt/ocfs2
                                   When first read the file b with
                                   ocfs2_read_inode_block(). It will
                                   validate the inode using
                                   ocfs2_validate_inode_block().
                                   Because OCFS2_VALID_FL not set in
                                   the inode i_flags, so the file
                                   system will be readonly.

So we should add inode into orphan dir after updating entry in
ocfs2_rename().

Signed-off-by: alex.chen <alex.chen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Jaegeuk Kim
98397ff3cd f2fs: fix not to allocate unnecessary blocks during fallocate
This patch fixes the fallocate bug like below. (See xfstests/255)

In fallocate(fd, 0, 20480),
expand_inode_data processes
	for (index = pg_start; index <= pg_end; index++) {
		f2fs_reserve_block();
		...
	}

So, even though fallocate requests 20480, 5 blocks, f2fs allocates 6 blocks
including pg_end.
So, this patch adds one condition to avoid block allocation.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:08 +09:00
Jaegeuk Kim
ead432756a f2fs: recover fallocated data and its i_size together
This patch arranges the f2fs_locks to cover the fallocated data and its i_size.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:08 +09:00
Jaegeuk Kim
ccfb30001f f2fs: fix to report newly allocate region as extent
Previous get_block in f2fs didn't report the newly allocated region which has
NEW_ADDR.
For reader, it should not report, but fiemap needs this.
So, this patch introduces two get_block sharing core function.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:07 +09:00
Linus Torvalds
2dfded8210 File locking related bugfixes for v3.16 (pile #2)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTpiwZAAoJEAAOaEEZVoIVplsP/383a9q3eXonbsi+Ea8CGbRl
 tdjjVhM1OY4NZYFAoulILDt3HqPTC6MBnqKlHz+BuMziwd/1+3w8S4E7IEwm/KtM
 ghNYX8ct5Bf1nc5QEdDmwf4PX48QbRwTuT1uIcXaJ+KtxTzI9qN7mnjRN91TUtq4
 WRGvOl0AsGCVq8YqxjztgD3TYbu7AG/72Em+DE9f81PTArAPTo2ySc3gxPuJJAsg
 G1x46Gx46sfqFX2FY4SPsXen+J/67Og67y6eBawxnT2Bp6ZGDuW+jyPRmkhf0yth
 pWAtkUi3XmEe6kk6GHiICsS0Yn0RG4jbz39+Ja+X7jibQVJ8Iz6b+Optw9RNQwYt
 jDWHKFS2AaL/CDejHYOQ1shHcozpRojtIbDLIZ9vTNTQ2r5cdaBvkXMmQzdoktmN
 wQtQ9AzBl8fHOFOQeCAwd/ZfCLIotvLoLds3K/CSqmpsyK2+9IyriQLKKZ2xm6Iu
 8+UUspGQcNVwcMP6YWtI6G+u58/mVanmK6dtpiyXrncZLAfU4H7ETL2IPu8jJTbv
 kTFCOJtXQzNZa5Xqur1hIewOG9/RlvAZAnii0Ghc3nXTWCCeNeI8re3jw4g9KdRv
 33t4sYfJld8LQ1NSMqIDyAs+fvytmGurYt+uhVpb58G/4CqBLNpmmIGIQ6LFLMfs
 75FQnbAezrD0H/JyAHUk
 =AJD3
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux

Pull file locking fixes from Jeff Layton:
 "File locking related bugfixes

  Nothing too earth-shattering here.  A fix for a potential regression
  due to a patch in pile #1, and the addition of a memory barrier to
  prevent a race condition between break_deleg and generic_add_lease"

* tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux:
  locks: set fl_owner for leases back to current->files
  locks: add missing memory barrier in break_deleg
2014-06-21 16:40:30 -10:00
Linus Torvalds
e13d100beb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This fixes some lockups in btrfs reported with rc1.  It probably has
  some performance impact because it is backing off our spinning locks
  more often and switching to a blocking lock.  I'll be able to nail
  that down next week, but for now I want to get the lockups taken care
  of.

  Otherwise some more stack reduction and assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong error handle when the device is missing or is not writeable
  Btrfs: fix deadlock when mounting a degraded fs
  Btrfs: use bio_endio_nodec instead of open code
  Btrfs: fix NULL pointer crash when running balance and scrub concurrently
  btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
  Btrfs: fix broken free space cache after the system crashed
  Btrfs: make free space cache write out functions more readable
  Btrfs: remove unused wait queue in struct extent_buffer
  Btrfs: fix deadlocks with trylock on tree nodes
2014-06-21 14:21:43 -10:00
Linus Torvalds
147f1404db Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "Fixes for a new regression from the xdr encoding rewrite, and a
  delegation problem we've had for a while (made somewhat more annoying
  by the vfs delegation support added in 3.13)"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  NFSD: fix bug for readdir of pseudofs
  NFSD: Don't hand out delegations for 30 seconds after recalling them.
2014-06-21 14:20:38 -10:00
Miao Xie
8408c716d7 Btrfs: fix wrong error handle when the device is missing or is not writeable
The original bio might be submitted, so we shoud increase bi_remaining to
account for it when we deal with the error that the device is missing or
is not writeable, or we would skip the endio handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie
c55f139640 Btrfs: fix deadlock when mounting a degraded fs
The deadlock happened when we mount degraded filesystem, the reproduced
steps are following:
 # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
 # echo 1 > /sys/block/`basename <dev0>`/device/delete
 # mount -o degraded <dev1> <mnt>

The reason was that the counter -- bi_remaining was wrong. If the missing
or unwriteable device was the last device in the mapping array, we would
not submit the original bio, so we shouldn't increase bi_remaining of it
in btrfs_end_bio(), or we would skip the final endio handle.

Fix this problem by adding a flag into btrfs bio structure. If we submit
the original bio, we will set the flag, and we increase bi_remaining counter,
or we don't.

Though there is another way to fix it -- decrease bi_remaining counter of the
original bio when we make sure the original bio is not submitted, this method
need add more check and is easy to make mistake.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie
e990f16763 Btrfs: use bio_endio_nodec instead of open code
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Wang Shilong
298a8f9cf1 Btrfs: fix NULL pointer crash when running balance and scrub concurrently
While running balance, scrub, fsstress concurrently we hit the
following kernel crash:

[56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
[56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
[56561.524325] Oops: 0000 [#1] SMP
[....]
[56561.527237] Call Trace:
[56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
[56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
[56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
[56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
[56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
[56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
[56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
[56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
[56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
[...]
[56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.528395]  RSP <ffff88004c0f5be8>
[56561.528454] CR2: 0000000000000078

This is because in btrfs_relocate_chunk(), we will free @bdev directly while
scrub may still hold extent mapping, and may access freed memory.

Fix this problem by wrapping freeing @bdev work into free_extent_map() which
is based on reference count.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Qu Wenruo
ced96edc48 btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
When run scrub with balance, sometimes -ENOENT will be returned, since
in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
chunk is removed but not committed, -ENOENT will be returned.

However, there is no need to stop scrubbing since other chunks may be
scrubbed without problem.

So this patch changes the behavior to skip removed chunks and continue
to scrub the rest.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie
e570fd27f2 Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group xxxx has wrong amount of free space
  BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
  space cache
- account the size of the allocated space that is used to store the file
  data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie
5349d6c3ff Btrfs: make free space cache write out functions more readable
This patch makes the free space cache write out functions more readable,
and beisdes that, it also reduces the stack space that the function --
__btrfs_write_out_cache uses from 194bytes to 144bytes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Filipe Manana
46fefe41b5 Btrfs: remove unused wait queue in struct extent_buffer
The lock_wq wait queue is not used anywhere, therefore just remove it.
On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
bytes down to 296 bytes, which means a 4Kb page can now be used for
13 extent buffers instead of 12.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:28 -07:00
Chris Mason
ea4ebde02e Btrfs: fix deadlocks with trylock on tree nodes
The Btrfs tree trylock function is poorly named.  It always takes
the spinlock and backs off if the blocking lock is held.  This
can lead to surprising lockups because people expect it to really be a
trylock.

This commit makes it a pure trylock, both for the spinlock and the
blocking lock.  It also reworks the nested lock handling slightly to
avoid taking the read lock while a spinning write lock might be held.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:19:55 -07:00
Jeff Layton
08bc03539d cifs: revalidate mapping prior to satisfying read_iter request with cache=loose
Before satisfying a read with cache=loose, we should always check
that the pagecache is valid before allowing a read to be satisfied
out of it.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-06-19 13:34:04 -05:00
Kinglong Mee
f41c5ad2ff NFSD: fix bug for readdir of pseudofs
Commit 561f0ed498 (nfsd4: allow large readdirs) introduces a bug
about readdir the root of pseudofs.

Call xdr_truncate_encode() revert encoded name when skipping.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-17 16:42:48 -04:00
NeilBrown
6282cd5655 NFSD: Don't hand out delegations for 30 seconds after recalling them.
If nfsd needs to recall a delegation for some reason it implies that there is
contention on the file, so further delegations should not be handed out.

The current code fails to do so, and the result is effectively a
live-lock under some workloads: a client attempting a conflicting
operation on a read-delegated file receives NFS4ERR_DELAY and retries
the operation, but by the time it retries the server may already have
given out another delegation.

We could simply avoid delegations for (say) 30 seconds after any recall, but
this is probably too heavy handed.

We could keep a list of inodes (or inode numbers or filehandles) for recalled
delegations, but that requires memory allocation and searching.

The approach taken here is to use a bloom filter to record the filehandles
which are currently blocked from delegation, and to accept the cost of a few
false positives.

We have 2 bloom filters, each of which is valid for 30 seconds.   When a
delegation is recalled the filehandle is added to one filter and will remain
disabled for between 30 and 60 seconds.

We keep a count of the number of filehandles that have been added, so when
that count is zero we can bypass all other tests.

The bloom filters have 256 bits and 3 hash functions.  This should allow a
couple of dozen blocked  filehandles with minimal false positives.  If many
more filehandles are all blocked at once, behaviour will degrade towards
rejecting all delegations for between 30 and 60 seconds, then resetting and
allowing new delegations.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-17 16:42:47 -04:00
Konstantin Khlebnikov
ebe06187bf epoll: fix use-after-free in eventpoll_release_file
This fixes use-after-free of epi->fllink.next inside list loop macro.
This loop actually releases elements in the body.  The list is
rcu-protected but here we cannot hold rcu_read_lock because we need to
lock mutex inside.

The obvious solution is to use list_for_each_entry_safe().  RCU-ness
isn't essential because nobody can change this list under us, it's final
fput for this file.

The bug was introduced by ae10b2b4eb ("epoll: optimize EPOLL_CTL_DEL
using rcu")

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Stable <stable@vger.kernel.org> # 3.13+
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-16 17:21:59 -10:00
Björn Baumbach
a1d0b84c30 fs/cifs: fix regression in cifs_create_mf_symlink()
commit d81b8a40e2
("CIFS: Cleanup cifs open codepath")
changed disposition to FILE_OPEN.

Signed-off-by: Björn Baumbach <bb@sernet.de>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Pavel Shilovsky <piastry@etersoft.ru>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-06-16 13:50:11 -05:00
Jan Kara
c5c7b8ddfb ext4: Fix buffer double free in ext4_alloc_branch()
Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer

Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk

The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-06-15 23:46:28 -04:00
Linus Torvalds
16d52ef7c0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This has a few fixes since our last pull and a new ioctl for doing
  btree searches from userland.  It's very similar to the existing
  ioctl, but lets us return larger items back down to the app"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix error handling in create_pending_snapshot
  btrfs: fix use of uninit "ret" in end_extent_writepage()
  btrfs: free ulist in qgroup_shared_accounting() error path
  Btrfs: fix qgroups sanity test crash or hang
  btrfs: prevent RCU warning when dereferencing radix tree slot
  Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
  btrfs: new ioctl TREE_SEARCH_V2
  btrfs: tree_search, search_ioctl: direct copy to userspace
  btrfs: new function read_extent_buffer_to_user
  btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
  btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
  btrfs: tree_search, search_ioctl: accept varying buffer
  btrfs: tree_search: eliminate redundant nr_items check
2014-06-14 19:48:43 -05:00
Linus Torvalds
a311c48038 Merge git://git.kvack.org/~bcrl/aio-next
Pull aio fix and cleanups from Ben LaHaise:
 "This consists of a couple of code cleanups plus a minor bug fix"

* git://git.kvack.org/~bcrl/aio-next:
  aio: cleanup: flatten kill_ioctx()
  aio: report error from io_destroy() when threads race in io_destroy()
  fs/aio.c: Remove ctx parameter in kiocb_cancel
2014-06-14 19:43:27 -05:00
Eric Sandeen
47a306a748 btrfs: fix error handling in create_pending_snapshot
fcebe456 cut and pasted some code to a later point
in create_pending_snapshot(), but didn't switch
to the appropriate error handling for this stage
of the function.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:30 -07:00
Eric Sandeen
3e2426bd0e btrfs: fix use of uninit "ret" in end_extent_writepage()
If this condition in end_extent_writepage() is false:

	if (tree->ops && tree->ops->writepage_end_io_hook)

we will then test an uninitialized "ret" at:

	ret = ret < 0 ? ret : -EIO;

The test for ret is for the case where ->writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no ->writepage_end_io_hook, nothing sets ret.

Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.

Signed-of-by: Eric Sandeen <sandeen@redhat.com>

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:28 -07:00
Eric Sandeen
d737278091 btrfs: free ulist in qgroup_shared_accounting() error path
If tmp = ulist_alloc(GFP_NOFS) fails, we return without
freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS)
and cause a memory leak.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:26 -07:00
Filipe Manana
b050f9f6dd Btrfs: fix qgroups sanity test crash or hang
Often when running the qgroups sanity test, a crash or a hang happened.
This is because the extent buffer the test uses for the root node doesn't
have an header level explicitly set, making it have a random level value.
This is a problem when it's not zero for the btrfs_search_slot() calls
the test ends up doing, resulting in crashes or hangs such as the following:

[ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
(...)
[ 6454.127760] BTRFS: selftest: Running qgroup tests
[ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
[ 6454.127966] BTRFS: selftest: Qgroup basic add
[ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
[ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
[ 6480.152005] irq event stamp: 188448
[ 6480.152005] hardirqs last  enabled at (188447): [<ffffffff8168ef5c>] restore_args+0x0/0x30
[ 6480.152005] hardirqs last disabled at (188448): [<ffffffff81698e6a>] apic_timer_interrupt+0x6a/0x80
[ 6480.152005] softirqs last  enabled at (188446): [<ffffffff810516cf>] __do_softirq+0x1cf/0x450
[ 6480.152005] softirqs last disabled at (188441): [<ffffffff81051c25>] irq_exit+0xb5/0xc0
[ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 3.15.0-rc8-fdm-btrfs-next-33+ #4
[ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6480.152005] task: ffff8802146125a0 ti: ffff8800d0d00000 task.ti: ffff8800d0d00000
[ 6480.152005] RIP: 0010:[<ffffffff81349a63>]  [<ffffffff81349a63>] __write_lock_failed+0x13/0x20
[ 6480.152005] RSP: 0018:ffff8800d0d038e8  EFLAGS: 00000287
[ 6480.152005] RAX: 0000000000000000 RBX: ffffffff8168ef5c RCX: 000005deb8525852
[ 6480.152005] RDX: 0000000000000000 RSI: 0000000000001d45 RDI: ffff8802105000b8
[ 6480.152005] RBP: ffff8800d0d038e8 R08: fffffe12710f63db R09: ffffffffa03196fb
[ 6480.152005] R10: ffff8802146125a0 R11: ffff880214612e28 R12: ffff8800d0d03858
[ 6480.152005] R13: 0000000000000000 R14: ffff8800d0d00000 R15: ffff8802146125a0
[ 6480.152005] FS:  00007f14ff804700(0000) GS:ffff880215e00000(0000) knlGS:0000000000000000
[ 6480.152005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 6480.152005] CR2: 00007fff4df0dac8 CR3: 00000000d1796000 CR4: 00000000000006f0
[ 6480.152005] Stack:
[ 6480.152005]  ffff8800d0d03908 ffffffff810ae967 0000000000000001 ffff8802105000b8
[ 6480.152005]  ffff8800d0d03938 ffffffff8168e57e ffffffffa0319c16 0000000000000007
[ 6480.152005]  ffff880210500000 ffff880210500100 ffff8800d0d039b8 ffffffffa0319c16
[ 6480.152005] Call Trace:
[ 6480.152005]  [<ffffffff810ae967>] do_raw_write_lock+0x47/0xa0
[ 6480.152005]  [<ffffffff8168e57e>] _raw_write_lock+0x5e/0x80
[ 6480.152005]  [<ffffffffa0319c16>] ? btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [<ffffffffa0319c16>] btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [<ffffffffa02b2acb>] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[ 6480.152005]  [<ffffffffa02b81a6>] btrfs_search_slot+0x916/0xa20 [btrfs]
[ 6480.152005]  [<ffffffff811a727f>] ? create_object+0x23f/0x300
[ 6480.152005]  [<ffffffffa02b9958>] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
[ 6480.152005]  [<ffffffffa036041a>] insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
[ 6480.152005]  [<ffffffffa03605c3>] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
[ 6480.152005]  [<ffffffff8108cad6>] ? local_clock+0x16/0x30
[ 6480.152005]  [<ffffffffa035ef8e>] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
[ 6480.152005]  [<ffffffffa03a69d2>] ? ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
[ 6480.152005]  [<ffffffffa03a6a86>] init_btrfs_fs+0xb4/0x153 [btrfs]
[ 6480.152005]  [<ffffffff81000352>] do_one_initcall+0x102/0x150
[ 6480.152005]  [<ffffffff8103d223>] ? set_memory_nx+0x43/0x50
[ 6480.152005]  [<ffffffff81682668>] ? set_section_ro_nx+0x6d/0x74
[ 6480.152005]  [<ffffffff810d91cc>] load_module+0x1cdc/0x2630
(...)

Therefore initialize the extent buffer as an empty leaf (level 0).

Issue easy to reproduce when btrfs is built as a module via:

    $ for ((i = 1; i <= 1000000; i++)); do rmmod btrfs; modprobe btrfs; done

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:24 -07:00
Sasha Levin
f1e3c28949 btrfs: prevent RCU warning when dereferencing radix tree slot
Mark the dereference as protected by lock. Not doing so triggers
an RCU warning since the radix tree assumed that RCU is in use.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:22 -07:00
Wang Shilong
5fbc7c59fd Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
Steps to reproduce:

 # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5
 # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device
 # mount /dev/sdb /mnt -o degraded
 # btrfs scrub start -BRd /mnt

This is because readahead would skip missing device, this is not true
for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block
mapping. If expected data locates in missing device, readahead thread
would not call __readahead_hook() which makes event @rc->elems=0
wait forever.

Fix this problem by checking return value of btrfs_map_block(),we
can only skip missing device safely if there are several mirrors.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:21 -07:00