Three atomic_t variables cause a lot of bus locking. Because they are all
used in the same places in the code, just use a single spin lock.
Now that the atomic_t variables are gone, we can remove the request size
limitation since the code no longer depends on the limited width of atomic_t
on some platforms.
Test plan:
Compile with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled. Millions of fsx
operations, iozone, OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up tab damage and comments. Replace "file_offset" with more commonly
used "pos".
Test plan:
Compile with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For async iocb's, the NFS direct write path now returns EIOCBQUEUED,
and calls aio_complete when all the requested writes are finished. The
synchronous part of the NFS direct write path behaves exactly as it
was before.
Shared mapped NFS files will have some coherency difficulties when
accessed concurrently with aio+dio. Will need to explore how this
is handled in the local file system case.
Test plan:
aio-stress with "-O". OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the iocb argument all the way down to the direct write request
scheduler, and make it available in nfs_direct_write_result.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate the persistent use of automatic storage in all parts of the
NFS client's direct write path to pave the way for introducing support
for aio against files opened with the O_DIRECT flag.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Duplicate infrastructure from direct read path that will allow write
path to generate multiple write requests concurrently. This will
enable us to add support for aio in this path.
Temporarily we will lose the ability to do UNSTABLE writes followed by
a COMMIT in the direct write path. However, all applications I am
aware of that use NFS O_DIRECT currently write in relatively small
chunks, so this should not be inconvenient in any way.
Test plan:
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out the common piece of completing an NFS direct I/O request.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out a small common piece of the path that allocate nfs_direct_req
structures.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're about to add asynchrony to the NFS direct write path. Begin by
abstracting out the common pieces in the read path.
The first piece is nfs_direct_read_wait, which works the same whether the
process is waiting for a read or a write.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For async iocb's, the NFS direct read path should return EIOCBQUEUED and
call aio_complete when all the requested reads are finished. The
synchronous part of the NFS direct read path behaves exactly as it was
before.
Test plan:
aio-stress with "-O". OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the iocb argument all the way down to the direct read request
scheduler, and make it available in nfs_direct_read_result.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate the persistent use of automatic storage in all parts of the NFS
client's direct read path to pave the way for introducing support for aio
against files opened with the O_DIRECT flag.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
size_t is used for holding byte counts, so use it for variables storing rsize.
Note that the write path will be updated as we add support for async
O_DIRECT writes.
Test plan:
Need to verify that existing comparisons against new size_t variables behave
correctly.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Update to latest coding style standards. Remove block comments on
statically defined functions, and place function definitions all on
one line.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFS client's a_ops->direct_IO method, nfs_direct_IO, is required to
be present to allow NFS files to be opened with O_DIRECT, but is never
called because the NFS client shunts reads and writes to files opened
with O_DIRECT directly to its own routines.
Gut the nfs_direct_IO function. This eliminates the only part of the
NFS client's direct I/O path that requires support for multi-segment
iovs, allowing further simplification in subsequent patches.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled. Millions
of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Same callback hierarchy inversion as for the NFS write calls. This patch is
not strictly speaking needed by the O_DIRECT code, but avoids confusing
differences between the asynchronous read and write code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch inverts the callback hierarchy for NFS write calls.
Instead of having the NFSv2/v3/v4-specific code set up the RPC callback
ops, we allow the original caller to do so. This allows for more
flexibility w.r.t. how to set up and tear down the nfs_write_data
structure while still allowing the NFSv3/v4 code to perform error
handling.
The greater flexibility is needed by the asynchronous O_DIRECT code, which
wants to be able to hold on to the original nfs_write_data structures after
the WRITE RPC call has completed in order to be able to replay them if the
COMMIT call determines that the server has rebooted.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
posix_test_lock() returns a pointer to a struct file_lock which is unprotected
and can be removed while in use by the caller. Move the conflicting lock from
the return to a parameter, and copy the conflicting lock.
In most cases the caller ends up putting the copy of the conflicting lock on
the stack. On i386, sizeof(struct file_lock) appears to be about 100 bytes.
We're assuming that's reasonable.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reuse NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide additional
diagnostic messages that trace the operation of the NFS client's
directory behavior. A few new messages are now generated when NFSDBG_VFS
is active, as well, to trace normal VFS activity. This compromise
provides better trace debugging for those who use pre-built kernels,
without adding a lot of extra noise to the standard debug settings.
Test-plan:
Enable NFS trace debugging with flags 1, 2, or 4. You should be able to
see different types of trace messages with each flag setting.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up: replace rpc_call() helper with direct call to rpc_call_sync.
This makes NFSv2 and NFSv3 synchronous calls more computationally
efficient, and reduces stack consumption in functions that used to
invoke rpc_call more than once.
Test plan:
Compile kernel with CONFIG_NFS enabled. Connectathon on NFS version 2,
version 3, and version 4 mount points.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add fields to the rpc_procinfo struct that allow the display of a
human-readable name for each procedure in the rpc_iostats output.
Also fix it so that the NFSv4 stats are broken up correctly by
sub-procedure number. NFSv4 uses only two real RPC procedures:
NULL, and COMPOUND.
Test plan:
Mount with NFSv2, NFSv3, and NFSv4, and do "cat /proc/self/mountstats".
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS client now shows various RPC I/O metrics in /proc/self/mountstats.
Test plan:
Mount/umount while doing "cat /proc/self/mountstats", multiple iterations
of connectathon locking suite. Test with NFS version 2, 3, and 4.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a field in nfs_server to record a timestamp when a mount succeeds.
Report the number of seconds the file system has been mounted via
nfs_show_stats().
Test plan:
Mount an NFS file system, watch the mountstats reports and compare with
clock time.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make an inode or an nfs_server struct available in the logic that handles
JUKEBOX/DELAY type errors so the NFS client can account for them.
This patch is split out from the main nfs iostat patch to highlight minor
architectural changes required to support this statistic.
Test plan:
None.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Invoke the byte and event counter macros where we want to count bytes and
events.
Clean-up: fix a possible NULL dereference in nfs_lock, and simplify
nfs_file_open.
Test-plan:
fsx and iozone on UP and SMP systems, with and without pre-emption. Watch
for memory overwrite bugs, and performance loss (significantly more CPU
required per op).
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a per-superblock performance counter facility to the NFS client. This
facility mimics the counters available for block devices and for
networking. Expose these new counters via the new /proc/self/mountstats
interface.
Thanks to Andrew Morton and Trond Myklebust for their review and comments.
Test plan:
fsx and iozone on UP and SMP systems, with and without pre-emption. Watch
for memory overwrite bugs, and performance loss (significantly more CPU
required per op).
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Get rid of "lock" and "posix", and spell out "vers=".
Test plan:
None.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sometimes it's important to know the exact RPC retransmit settings the
kernel is using for an NFS mount point. Add this facility to the NFS
client's show_options method.
Test plan:
Set various retransmit settings via the mount command, and check that the
settings are reflected in /proc/mounts.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
semaphore to mutex conversion.
the conversion was generated via scripts, and the result was validated
automatically via a script as well.
build and boot tested.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
this converts fs/nfs to kzalloc() usage.
compile tested with make allyesconfig
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_open_revalidate: 'res' may be used uninitialized
nfs4_callback_compound: ‘hdr_res.nops’ may be used uninitialized
'op_nr’ may be used uninitialized
encode_getattr_res: ‘savep’ may be used uninitialized
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If not, we cannot guarantee that idmap->idmap_dentry, gss_auth->dentry and
clnt->cl_dentry are valid dentries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
My previous "const static" vs "static const" cleanup missed a single case,
patch below takes care of it.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we flush out writes in the case when someone calls utimes() in
order to set the file times.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I've been reading through fs/nfs/write.c trying to track down a bug
that seems to be related to pages loosing a refcount and getting
freed too early (you interested in detail??) and I spotted a little
bug which the following patch should fix.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, there is no serialisation between NFS asynchronous writebacks
and truncation at the page level due to the fact that nfs_sync_inode()
cannot lock the pages that it is about to write out.
This means that it is possible to be flushing out data (and calling something
like set_page_writeback()) while the page cache is busy evicting the page.
Oops...
Use the hooks provided in try_to_release_page() to ensure that dirty pages
are always written back to storage before we evict them.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs_open_context may live longer than the file descriptor that spawned
it, so it needs to carry a reference to the vfsmount. If not, then
generic_shutdown_super() may end up being called before reads and writes
have been flushed out.
Make a couple of functions static while we're at it...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It turns out that nfs4_proc_get_root() may return raw NFSv4 errors instead of
mapping them to kernel errors. Problem spotted by Neil Horman
<nhorman@tuxdriver.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Based on an original patch by Mike O'Connor and Greg Banks of SGI.
Mike states:
A normal user can panic an NFS client and cause a local DoS with
'judicious'(?) use of O_DIRECT. Any O_DIRECT write to an NFS file where the
user buffer starts with a valid mapped page and contains an unmapped page,
will crash in this way. I haven't followed the code, but O_DIRECT reads with
similar user buffers will probably also crash albeit in different ways.
Details: when nfs_get_user_pages() calls get_user_pages(), it detects and
correctly handles get_user_pages() returning an error, which happens if the
first page covered by the user buffer's address range is unmapped. However,
if the first page is mapped but some subsequent page isn't, get_user_pages()
will return a positive number which is less than the number of pages requested
(this behaviour is sort of analagous to a short write() call and appears to be
intentional). nfs_get_user_pages() doesn't detect this and hands off the
array of pages (whose last few elements are random rubbish from the newly
allocated array memory) to it's caller, whence they go to
nfs_direct_write_seg(), which then totally ignores the nr_pages it's given,
and calculates its own idea of how many pages are in the array from the user
buffer length. Needless to say, when it comes to transmit those uninitialised
page* pointers, we see a crash in the network stack.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Direct backport of 2.4 fix that didn't get propagated to 2.6; original
comment follows:
<quote>
When I specify the NFS port for nfsroot (e.g.,
nfsroot=<dir>,port=2049), the
kernel uses the wrong port. In my case it tries to use 264 (0x108)
instead
of 2049 (0x801).
This patch adds the missing htons().
Eric
</quote>
Patch got applied in 2.4.21-pre6. Author: Eric Lammerts (<eric@lammerts.org>,
AFAICS).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Turn noatime and nodiratime into per-mount instead of per-sb flags.
After all the preparations this is a rather trivial patch. The mount code
needs to treat the two options as per-mount instead of per-superblock, and
touch_atime needs to be changed to check the new MNT_ flags in addition to
the MS_ flags that are kept for filesystems that are always
noatime/nodiratime but not user settable anymore. Besides that core code
only nfs needed an update because it's leaving atime updates to the server
and thus sets the S_NOATIME flag on every inode, but needs to know whether
it's a real noatime mount for an getattr optimization.
While we're at it I've killed the IS_NOATIME/IS_NODIRATIME macros that were
only used by touch_atime.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It would be helpful if the kernel did not silently stop parsing
nfs options, but instead warned about any he does not recognize. The
attached patch adds one printk to do just that.
It took me a couple of hours to find my configuration mistake.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If someone changes the uid/gid mapping in userland, then we do eventually
want those changes to be propagated to the kernel. Currently the kernel
assumes that it may cache entries forever.
Add an expiration time + garbage collector for idmap entries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
inode->i_mode contains a lot more than just the mode bits. Make sure that
we mask away this extra stuff in SETATTR calls to the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
just get rid of cl_chatty and always be verbose.
Test-plan:
Compile with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Thanks to Ed Keizer for bug and root cause. He says: "... we could only mount
the top-level Solaris share. We could not mount deeper into the tree.
Investigation showed that Solaris allows UNIX authenticated FSINFO only on the
top level of the share. This is a problem because we share/export our home
directories one level higher than we mount them. I.e. we share the partition
and not the individual home directories. This prevented access to home
directories."
We still may need to try auth_sys for the case where the client doesn't have
appropriate credentials.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Upon return of a write delegation, the server will almost always bump the
change attribute. Ensure that we pick up that change so that we don't
invalidate our data cache unnecessarily.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
According to RFC3530 we're supposed to cache the change attribute
at the time the client receives a write delegation.
If the inode is clean, a CB_GETATTR callback by the server to the
client is supposed to return the cached change attribute.
If, OTOH, the inode is dirty, the client should bump the cached
change attribute by 1.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The SuS states that a call to write() will cause mtime to be updated on
the file. In order to satisfy that requirement, we need to flush out
any cached writes in nfs_getattr().
Speed things up slightly by not committing the writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Most NFS server implementations allow up to 64KB reads and writes on the
wire. The Solaris NFS server allows up to a megabyte, for instance.
Now the Linux NFS client supports transfer sizes up to 1MB, too. This will
help reduce protocol and context switch overhead on read/write intensive NFS
workloads, and support larger atomic read and write operations on servers
that support them.
Test-plan:
Connectathon and iozone on mount point with wsize=rsize>32768 over TCP.
Tests with NFS over UDP to verify the maximum RPC payload size cap.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To help NFS users and server developers, make the "inode number mismatch"
message display more useful information.
Test-plan:
None.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_statfs() generates a log message when GETATTR returns an error. This
is usually a useless message. Make it a dprintk.
Test plan:
None
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Red Hat found a problem in the error recovery logic in __init_nfs.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace ad hoc write parameter sanity checking in nfs_file_direct_write()
with a call to generic_write_checks(). This should make the proper checks
modulo the O_LARGEFILE flag, and should catch NFSv2-specific limitations by
virtue of i_sb->s_maxbytes.
Test plan:
Posix compliance testing with both NFSv2 and NFSv3.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In RFC3530, the RENEW operation is allowed to use either
the same principal, RPC security flavour and (if RPCSEC_GSS), the same
mechanism and service that was used for SETCLIENTID_CONFIRM
OR
Any principal, RPC security flavour and service combination that
currently has an OPEN file on the server.
Choose the latter since that doesn't require us to keep credentials for
the same principal for the entire duration of the mount.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Convert private implementations in NFSv4 state recovery and delegation
code to use kthreads.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When recovering from a delegation recall or a network partition, we need
to replay open(O_RDWR), open(O_RDONLY) and open(O_WRONLY) separately.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A closer reading of RFC3530 reveals that OPEN_DOWNGRADE must always
specify a access modes that have been the argument of a previous OPEN
operation.
IOW: doing OPEN(O_RDWR) and then OPEN_DOWNGRADE(O_WRONLY) is forbidden
unless the user called OPEN(O_WRONLY)
In order to fix that, we really need to track the three possible open
states separately.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
OPEN is a stateful operation, so we must ensure that it always
completes. In order to allow users to interrupt the operation,
we need to make the RPC call asynchronous, and then wait on
completion (or cancel).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 model requires us to complete all RPC calls that might
establish state on the server whether or not the user wants to
interrupt it. We may also need to schedule new work (including
new RPC calls) in order to cancel the new state.
The asynchronous RPC model will allow us to ensure that RPC calls
always complete, but in order to allow for "synchronous" RPC, we
want to add the ability to wait for completion.
The waits are, of course, interruptible.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Shrink the RPC task structure. Instead of storing separate pointers
for task->tk_exit and task->tk_release, put them in a structure.
Also pass the user data pointer as a parameter instead of passing it via
task->tk_calldata. This enables us to nest callbacks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we always initiate flushing of data before we exit
a single-page ->writepage() call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.
Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFS client prevents mandatory lock, but there is a flaw on it; Locks are
possibly left if the mode is changed while locking.
This permits unlocking even if the mandatory lock bits are set.
Signed-off-by: ASANO Masahiro <masano@tnes.nec.co.jp>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ensure we call unmap_mapping_range() and sync dirty pages to disk before
doing an NFS direct write.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- Missing initialisation of attribute bitmask in _nfs4_proc_write()
- On success, _nfs4_proc_write() must return number of bytes written.
- Missing post_op_update_inode() in _nfs4_proc_write()
- Missing initialisation of attribute bitmask in _nfs4_proc_commit()
- Missing post_op_update_inode() in _nfs4_proc_commit()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we use set_page_writeback() in the appropriate places
to help the VM in keeping its page radix_tree in sync.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Steve Dickson writes:
Doing the following:
1. On server:
$ mkdir ~/t
$ echo Hello > ~/t/tmp
2. On client, wait for a string to appear in this file:
$ until grep -q foo t/tmp ; do echo -n . ; sleep 1 ; done
3. On server, create a *new* file with the same name containing that
string:
$ mv ~/t/tmp ~/t/tmp.old; echo foo > ~/t/tmp
will show how the client will never (and I mean never ;-) ) see
the updated file.
The problem is that we do not update nfsi->cache_change_attribute when the
file changes on the server (we only update it when our client makes the
changes). This again means that functions like nfs_check_verifier() will
fail to register when the parent directory has changed and should trigger
a dentry lookup revalidation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make sure cache_change_attribute is initialized to jiffies
so when the mtime changes on directory, the directory
will be refreshed.
Signed-off by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In cases where the server has gone insane, nfs_update_inode() may end
up calling nfs_invalidate_inode(), which again calls stuff that takes
the inode->i_lock that we're already holding.
In addition, given the sort of things we have in NFS these days that
need to be cleaned up on inode release, I'm not sure we should ever
be calling make_bad_inode().
Fix up spinlock recursion, and limit nfs_invalidate_inode() to clearing
the caches, and marking the inode as being stale.
Thanks to Steve Dickson <SteveD@redhat.com> for spotting this.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When caching locks due to holding a file delegation, we must always
check against local locks before sending anything to the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is the fs/ part of the big kfree cleanup patch.
Remove pointless checks for NULL prior to calling kfree() in fs/.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix some dprintk's so that NLM, NFS client, and RPC client compile
cleanly if CONFIG_SYSCTL is disabled.
Test plan:
Compile kernel with CONFIG_NFS enabled and CONFIG_SYSCTL disabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have a method of dealing with delegation recalls, actually
enable the caching of posix and BSD locks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Delegations allow us to cache posix and BSD locks, however when the
delegation is recalled, we need to "flush the cache" and send
the cached LOCK requests to the server.
This patch sets up the mechanism for doing so.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I missed this one... Any form of rename will result in a delegation
recall, so it is more efficient to return the one we hold before
trying the rename.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC 3530 states that for OPEN_DOWNGRADE "The share_access and share_deny
bits specified must be exactly equal to the union of the share_access and
share_deny bits specified for some subset of the OPENs in effect for
current openowner on the current file.
Setattr is currently violating the NFSv4 rules for OPEN_DOWNGRADE in that
it may cause a downgrade from OPEN4_SHARE_ACCESS_BOTH to
OPEN4_SHARE_ACCESS_WRITE despite the fact that there exists no open file
with O_WRONLY access mode.
Fix the problem by replacing nfs4_find_state() with a modified version of
nfs_find_open_context().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We must not remove the nfs4_state structure from the inode open lists
before we are in sequence lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Optimise attribute revalidation when hardlinking. Add post-op attributes
for the directory and the original inode.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
"Optional" means that the close call will not fail if the getattr
at the end of the compound fails.
If it does succeed, try to refresh inode attributes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since the directory attributes change every time we CREATE a file,
we might as well pick up the new directory attributes in the same
compound.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_lookup() used to consult a lookup cache before trying an actual wire
lookup operation. The lookup cache would be invalid, of course, if the
parent directory's mtime had changed, so nfs_lookup performed an inode
revalidation on the parent.
Since nfs_lookup() doesn't use a cache anymore, the revalidation is no
longer necessary. There are cases where it will generate a lot of
unnecessary GETATTR traffic.
See http://bugzilla.linux-nfs.org/show_bug.cgi?id=9
Test-plan:
Use lndir and "rm -rf" and watch for excess GETATTR traffic or application
level errors.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since we almost always call nfs_end_data_update() after we called
nfs_refresh_inode(), we now end up marking the inode metadata
as needing revalidation immediately after having updated it.
This patch rearranges things so that we mark the inode as needing
revalidation _before_ we call nfs_refresh_inode() on those operations
that need it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow nfs_refresh_inode() also to update attributes on the inode if the
RPC call was sent after the last call to nfs_update_inode().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
resp_len is passed in as buffer size to decode routine; make sure it's
set right in case where userspace provides less than a page's worth of
buffer.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Stop handing garbage to userspace in the case where a weird server clears the
acl bit in the getattr return (despite the fact that they've already claimed
acl support.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Storing a pointer to the struct rpc_task in the nfs_seqid is broken
since the nfs_seqid may be freed well after the task has been destroyed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If someone tries to rename a directory onto an empty directory, we
currently fail and return EBUSY.
This patch ensures that we try the rename if both source and target
are directories, and that we fail with a correct error of EISDIR if
the source is not a directory.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server is in the unconfirmed OPEN state for a given open owner
and receives a second OPEN for the same open owner, it will cancel the
state of the first request and set up an OPEN_CONFIRM for the second.
This can cause a race that is discussed in rfc3530 on page 181.
The following patch allows the client to recover by retrying the
original open request.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make NFSv4 return the fully initialized file pointer with the
stateid that it created in the lookup w/intent.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We no longer need to worry about collisions between close() and the state
recovery code, since the new close will automatically recheck the
file state once it is done waiting on its sequence slot.
Ditto for the nfs4_proc_locku() procedure.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Once the state_owner and lock_owner semaphores get removed, it will be
possible for other OPEN requests to reopen the same file if they have
lower sequence ids than our CLOSE call.
This patch ensures that we recheck the file state once
nfs_wait_on_sequence() has completed waiting.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 file state-changing functions such as OPEN, CLOSE, LOCK,... are all
labelled with "sequence identifiers" in order to prevent the server from
reordering RPC requests, as this could cause its file state to
become out of sync with the client.
Currently the NFS client code enforces this ordering locally using
semaphores to restrict access to structures until the RPC call is done.
This, of course, only works with synchronous RPC calls, since the
user process must first grab the semaphore.
By dropping semaphores, and instead teaching the RPC engine to hold
the RPC calls until they are ready to be sent, we can extend this
process to work nicely with asynchronous RPC calls too.
This patch adds a new list called "rpc_sequence" that defines the order
of the RPC calls to be sent. We add one such list for each state_owner.
When an RPC call is ready to be sent, it checks if it is top of the
rpc_sequence list. If so, it proceeds. If not, it goes back to sleep,
and loops until it hits top of the list.
Once the RPC call has completed, it can then bump the sequence id counter,
and remove itself from the rpc_sequence list, and then wake up the next
sleeper.
Note that the state_owner sequence ids and lock_owner sequence ids are
all indexed to the same rpc_sequence list, so OPEN, LOCK,... requests
are all ordered w.r.t. each other.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Oopsable since nfs_wait_on_inode() can get called as part of iput_final().
Unnecessary since the caller had better be damned sure that the inode won't
disappear from underneath it anyway.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the data cache has been marked as potentially invalid by nfs_refresh_inode,
we should invalidate it rather than assume that changes are due to our own
activity.
Also ensure that we always start with a valid cache before declaring it
to be protected by a delegation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently rpc_mkdir/rpc_rmdir and rpc_mkpipe/mk_unlink have an API that's
a little unfortunate. They take a path relative to the rpc_pipefs root and
thus need to perform a full lookup. If you look at debugfs or usbfs they
always store the dentry for directories they created and thus can pass in
a dentry + single pathname component pair into their equivalents of the
above functions.
And in fact rpc_pipefs actually stores a dentry for all but one component so
this change not only simplifies the core rpc_pipe code but also the callers.
Unfortuntately this code path is only used by the NFS4 idmapper and
AUTH_GSSAPI for which I don't have a test enviroment. Could someone give
it a spin? It's the last bit needed before we can rework the
lookup_hash API
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Each transport implementation can now set unique bind, connect,
reestablishment, and idle timeout values. These are variables,
allowing the values to be modified dynamically. This permits
exponential backoff of any of these values, for instance.
As an example, we implement exponential backoff for the connection
reestablishment timeout.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement a best practice: don't use exponential backoff when computing
retransmit timeout values on TCP connections, but simply retransmit
at regular intervals.
This also fixes a bug introduced when xprt_reset_majortimeo() was added.
Test-plan:
Enable RPC debugging and watch timeout behavior on a NFS/TCP mount.
Version: Thu, 11 Aug 2005 16:02:19 -0400
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>