Add information about the mapping state of the direct mapping to
/proc/meminfo. I chose /proc/meminfo because that is where all the other
memory statistics are too and it is a generally useful metric even
outside debugging situations. A lot of split kernel pages means the
kernel will run slower.
This way we can see how many large pages are really used for it and how
many are split.
Useful for general insight into the kernel.
v2: Add hotplug locking to 64bit to plug a very obscure theoretical race.
32bit doesn't need it because it doesn't support hotadd for lowmem.
Fix some typos
v3: Rename dpages_cnt
Add CONFIG ifdef for count update as requested by tglx
Expand description
v4: Fix stupid bugs added in v3
Move update_page_count to pageattr.c
Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix some issues in pagemap_read noted by Alexey:
- initialize pagemap_walk.mm to "mm" , so the code starts working as
advertised
- initialize ->private to "&pm" so it wouldn't immediately oops in
pagemap_pte_hole()
- unstatic struct pagemap_walk, so two threads won't fsckup each other
(including those started by root, including flipping ->mm when you don't
have permissions)
- pagemap_read() contains two calls to ptrace_may_attach(), second one
looks unneeded.
- avoid possible kmalloc(0) and integer wraparound.
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Personally, I'd just remove the functionality entirely - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't use a static entry, so as to prevent races during concurrent use
of this function.
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit includes a bugfix for the fragile setuid fixup code in the
case that filesystem capabilities are supported (in access()). The effect
of this fix is gated on filesystem capability support because changing
securebits is only supported when filesystem capabilities support is
configured.)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The url in the help text for ntfs should be updated.
Acked-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The misc_mtx should provide all the protection required to keep the daemon
hash table sane during miscdev registration. Since this mutex is causing
gratuitous lockdep warnings, this patch removes it.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Reported-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When write in reiserfs_quota_write() fails, we have to properly release
i_mutex. One error path has been missing the unlock...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When write in ext4_quota_write() fails, we have to properly release
i_mutex. One error path has been missing the unlock...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When write in ext3_quota_write() fails, we have to properly release
i_mutex. One error path has been missing the unlock...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The legacy protocol's open operation doesn't handle an append operation
(it is expected that the client take care of it). We were incorrectly
passing the extended protocol's flag through even in legacy mode. This
was reported in bugzilla report #10689. This patch fixes the problem
by disallowing extended protocol open modes from being passed in legacy
mode and implemented append functionality on the client side by adding
a seek after the open.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
then immediately wait on them. Conceptually, that makes them sync writes
and we should treat them as such so that the IO schedulers can handle
them appropriately.
This patch fixes a write starvation issue that Lin Ming reported, where
xx is stuck for more than 2 minutes because of a large number of
synchronous IO in the system:
INFO: task kjournald:20558 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kjournald D ffff810010820978 6712 20558 2
ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
Call Trace:
[<ffffffff803ba6f2>] kobject_get+0x12/0x17
[<ffffffff80247537>] getnstimeofday+0x2f/0x83
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d195>] io_schedule+0x5d/0x9f
[<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
[<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff80243909>] wake_bit_function+0x0/0x23
[<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
[<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
[<ffffffff8023a676>] lock_timer_base+0x26/0x4b
[<ffffffff8030300a>] kjournald+0xc1/0x1fb
[<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
[<ffffffff80302f49>] kjournald+0x0/0x1fb
[<ffffffff802437bb>] kthread+0x47/0x74
[<ffffffff8022de51>] schedule_tail+0x28/0x5d
[<ffffffff8020cac8>] child_rip+0xa/0x12
[<ffffffff80243774>] kthread+0x0/0x74
[<ffffffff8020cabe>] child_rip+0x0/0x12
Lin Ming confirms that this patch fixes the issue. I've run tests with
it for the past week and no ill effects have been observed, so I'm
proposing it for inclusion into 2.6.26.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch fixes bz 450641.
This patch changes the computation for zero_metapath_length(), which it
renames to metapath_branch_start(). When you are extending the metadata
tree, The indirect blocks that point to the new data block must either
diverge from the existing tree either at the inode, or at the first
indirect block. They can diverge at the first indirect block because the
inode has room for 483 pointers while the indirect blocks have room for
509 pointers, so when the tree is grown, there is some free space in the
first indirect block. What metapath_branch_start() now computes is the
height where the first indirect block for the new data block is located.
It can either be 1 (if the indirect block diverges from the inode) or 2
(if it diverges from the first indirect block).
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes bugzilla bug bz448866: gfs2: BUG: unable to
handle kernel paging request at ffff81002690e000.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In some cases it could happen that some block passed test in
udf_check_anchor_block() even though udf_read_tagged() refused to read it later
(e.g. because checksum was not correct). This patch makes
udf_check_anchor_block() use udf_read_tagged() so that the checking is
stricter.
This fixes the regression (certain disks unmountable) caused by commit
423cf6dc04.
Signed-off-by: Tomas Janousek <tomi@nomi.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix a sign issue in xdr_decode_fhstatus3()
Fix incorrect comparison in nfs_validate_mount_data()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix the following sparse warnings:
fs/dcache.c:2183:19: warning: symbol 'filp_cachep' was not declared. Should it be static?
fs/dcache.c:115:3: warning: context imbalance in 'dentry_iput' - unexpected unlock
fs/dcache.c:188:2: warning: context imbalance in 'dput' - different lock contexts for basic block
fs/dcache.c:400:2: warning: context imbalance in 'prune_one_dentry' - different lock contexts for basic block
fs/dcache.c:431:22: warning: context imbalance in 'prune_dcache' - different lock contexts for basic block
fs/dcache.c:563:2: warning: context imbalance in 'shrink_dcache_sb' - different lock contexts for basic block
fs/dcache.c:1385:6: warning: context imbalance in 'd_delete' - wrong count at exit
fs/dcache.c:1636:2: warning: context imbalance in '__d_unalias' - unexpected unlock
fs/dcache.c:1735:2: warning: context imbalance in 'd_materialise_unique' - different lock contexts for basic block
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The path that __d_path() computes can become slightly inconsistent when it
races with mount operations: it grabs the vfsmount_lock when traversing mount
points but immediately drops it again, only to re-grab it when it reaches the
next mount point. The result is that the filename computed is not always
consisent, and the file may never have had that name. (This is unlikely, but
still possible.)
Fix this by grabbing the vfsmount_lock for the whole duration of
__d_path().
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fl_insert and fl_remove are not used right now in the kernel. Remove them.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
generic_readlink calls ERR_PTR for negative and positive values
(vfs_readlink returns length of "link"), but it should not
(not an errno) and does not need to.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here are some more places where path_{get,put}() can be used instead of
dput()/mntput() pair.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The POSIX.1 draft spec for futimens()/utimensat() says:
Only a process with the effective user ID equal to the
user ID of the file, *or with write access to the file*,
or with appropriate privileges may use futimens() or
utimensat() with a null pointer as the times argument
or with both tv_nsec fields set to the special value
UTIME_NOW.
The important piece here is "with write access to the file", and
this matters for futimens(), which deals with an argument that
is a file descriptor referring to the file whose timestamps are
being updated, The standard is saying that the "writability"
check is based on the file permissions, not the access mode with
which the file is opened. (This behavior is consistent with the
semantics of FreeBSD's futimes().) However, Linux is currently
doing the latter -- futimens(fd, times) is a library
function implemented as
utimensat(fd, NULL, times, 0)
and within the utimensat() implementation we have the code:
f = fget(dfd); // dfd is 'fd'
...
if (f) {
if (!(f->f_mode & FMODE_WRITE))
goto mnt_drop_write_and_out;
The check should instead be based on the file permissions.
Thanks to Miklos for pointing out how to do this check.
Miklos also pointed out a simplification that could be
made to my first version of this patch, since the checks
for the pathname and file descriptor cases can now be
conflated.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The POSIX.1 draft spec for utimensat() says:
Only a process with the effective user ID equal to the
user ID of the file or with appropriate privileges may use
futimens() or utimensat() with a non-null times argument
that does not have both tv_nsec fields set to UTIME_NOW
and does not have both tv_nsec fields set to UTIME_OMIT.
If this condition is violated, then the error EPERM should result.
However, the current implementation does not generate EPERM if
one tv_nsec field is UTIME_NOW while the other is UTIME_OMIT.
It should give this error for that case.
This patch:
a) Repairs that problem.
b) Removes the now unneeded nsec_special() helper function.
c) Adds some comments to explain the checks that are being
performed.
Thanks to Miklos, who provided comments on the previous iteration
of this patch. As a result, this version is a little simpler and
and its logic is better structured.
Miklos suggested an alternative idea, migrating the
is_owner_or_cap() checks into fs/attr.c:inode_change_ok() via
the use of an ATTR_OWNER_CHECK flag. Maybe we could do that
later, but for now I've gone with this version, which is
IMO simpler, and can be more easily read as being correct.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The POSIX.1 draft spec for utimensat() says that if a times[n].tv_nsec
field is UTIME_OMIT or UTIME_NOW, then the value in the corresponding
tv_sec field is ignored. See the last sentence of this para, from
the spec:
If the tv_nsec field of a timespec structure has
the special value UTIME_NOW, the file's relevant
timestamp shall be set to the greatest value
supported by the file system that is not greater than
the current time. If the tv_nsec field has the
special value UTIME_OMIT, the file's relevant
timestamp shall not be changed. In either case,
the tv_sec field shall be ignored.
However the current Linux implementation requires the tv_sec value to be
zero (or the EINVAL error results). This requirement should be removed.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch fixes utimensat() to make its behavior consistent
with that of utime()/utimes() when dealing with files marked
immutable and append-only.
The current utimensat() implementation also returns EPERM if
'times' is non-NULL and the tv_nsec fields are both UTIME_NOW.
For consistency, the
(times != NULL && times[0].tv_nsec == UTIME_NOW &&
times[1].tv_nsec == UTIME_NOW)
case should be treated like the traditional utimes() case where
'times' is NULL. That is, the call should succeed for a file
marked append-only and should give the error EACCES if the file
is marked as immutable.
The simple way to do this is to set 'times' to NULL
if (times[0].tv_nsec == UTIME_NOW && times[1].tv_nsec == UTIME_NOW).
This is also the natural approach, since POSIX.1 semantics consider the
times == {{x, UTIME_NOW}, {y, UTIME_NOW}}
to be exactly equivalent to the case for
times == NULL.
(Thanks to Miklos for pointing this out.)
Patch 3 in this series relies on the simplification provided
by this patch.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
devcgroup_inode_permission() expects MAY_FOO, not FMODE_FOO; kindly
keep your misdesign consistent if you positively have to inflict it
on the kernel.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christian Borntraeger reported that reinstating cond_resched() with
CONFIG_PREEMPT caused a performance regression on lmbench:
For example select file 500:
23 microseconds
32 microseconds
and that's really because we totally unnecessarily do the cond_resched()
in the innermost loop of select(), which is just silly.
This moves it out from the innermost loop (which only ever loops ove the
bits in a single "unsigned long" anyway), which makes the performance
regression go away.
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the patch for the group descriptor table corruption during
online resize pointed out by Theodore Tso. The problem was caused by
the fact that the ext4 group descriptor can be either 32 or 64 bytes
long. Only the 64 bytes structure was taken into account.
Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use max not min to enforce a lower limit on the max I/O size.
This bug was introduced by "fuse: fix max i/o size calculation" (commit
e5d9a0df07).
Thanks to Brian Wang for noticing.
Reported-by: Brian Wang <ywang221@hotmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Szabolcs Szakacsits <szaka@ntfs-3g.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: Remove ->hangup() from stack glue operations.
ocfs2: Move the call of ocfs2_hb_ctl into the stack glue.
ocfs2: Move the hb_ctl_path sysctl into the stack glue.
The ->hangup() call was only used to execute ocfs2_hb_ctl. Now that
the generic stack glue code handles this, the underlying stack drivers
don't need to know about it.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Take o2hb_stop() out of the o2cb code and make it part of the generic
stack glue as ocfs2_leave_group(). This also allows us to remove the
ocfs2_get_hb_ctl_path() function - everything to do with hb_ctl is now
part of stackglue.c. o2cb no longer needs a ->hangup() function.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 needs to call out to the hb_ctl program at unmount for all cluster
stacks. The first step is to move the hb_ctl_path sysctl out of the
o2cb code and into the generic stack glue.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In commit d20894a237 ("Remove a.out
interpreter support in ELF loader"), Andi removed support for a.out
interpreters from the ELF loader, which was only ever needed for the
transition from a.out to ELF.
This removes the last traces of that support, in particular the
inclusion of <linux/a.out.h>.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We only need it for the /sbin/loader hack for OSF/1 executables, and we
don't want to include it otherwise.
While we're at it, remove the redundant '&& CONFIG_ARCH_SUPPORTS_AOUT'
in the ifdef around that code. It's already dependent on __alpha__, and
CONFIG_ARCH_SUPPORTS_AOUT is hard-coded to 'y' there.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 706047a797, "udf: Fix compilation
warnings when UDF debug is on" inadvertently (I assume) enabled
debugging messages by default for UDF. This patch disables them again.
Signed-off-by: Paul Collins <paul@ondioline.org>
Signed-off-by: Jan Kara <jack@suse.cz>
We were walking right into huge page areas in the pagemap walker, and
calling the pmds pmd_bad() and clearing them.
That leaked huge pages. Bad.
This patch at least works around that for now. It ignores huge pages in
the pagemap walker for the time being, and won't leak those pages.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need this at least for huge page detection for now, because powerpc
needs the vm_area_struct to be able to determine whether a virtual address
is referring to a huge page (its pmd_huge() doesn't work).
It might also come in handy for some of the other users.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New chmod() allows only acceptable permission, and if not acceptable, it
returns -EPERM. Old one allows even if it can't store permission to on
disk inode. But it seems too strict for users.
E.g. https://bugzilla.redhat.com/show_bug.cgi?id=449080: With new one,
rsync couldn't create the temporary file.
So, this patch allows like old one, but now it doesn't change the
permission if it can't store, and it returns 0.
Also, this patch fixes missing check.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>