Commit graph

813 commits

Author SHA1 Message Date
Frederik Deweerdt
cf13ab8e02 dm: remove md argument from specific_minor
The small patch below:
- Removes the unused md argument from both specific_minor() and next_free_minor()
- Folds kmalloc + memset(0) into a single kzalloc call in alloc_dev()

This has been compile tested on x86.

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:27:02 +01:00
Adrian Bunk
4fdfe401e9 dm table: remove unused dm_create_error_table
dm_create_error_table() was added in kernel 2.6.18 and never used...

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:27:00 +01:00
Adrian Bunk
e8488d0858 dm table: drop void suspend_targets return
void returning functions returned the return value of another void
returning function...

Spotted by sparse.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:59 +01:00
Mikulas Patocka
7ff14a3615 dm: unplug queues in threads
Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.

It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
queue (i.e. block the requests from being dispatched to the physical device).

The queue is unplugged when the caller calls blk_unplug() function. Usually, the
sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
the queue and waits (to be possibly joined with other adjacent bios). Then, when
the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
the disk.

This was happenning:

When doing O_SYNC writes, function fsync_buffers_list() submits a list of
bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
woken up.

fsync_buffers_list() calls wait_on_buffer().  That unplugs the queue, but
there are no bios on the device queue as they are still in the dm_raid1 queue.

wait_on_buffer() starts waiting until the IO is finished.

kmirrord is scheduled, kmirrord takes bios and submits them to the devices.

The submitted bio plugs the harddisk queue but there is no one to unplug it.
(The process that called wait_on_buffer() is already sleeping.)

So there is a 3ms timeout, after which the queues on the harddisks are
unplugged and requests are processed.

This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
dm-raid1 is 10 times slower than md raid1.

Every time we submit something asynchronously via dm_io, we must unplug the
queue actually to send the request to the device.

This patch adds an unplug call to kmirrord - while processing requests, it keeps
the queue plugged (so that adjacent bios can be merged); when it finishes
processing all the bios, it unplugs the queue to submit the bios.

It also fixes kcopyd which has the same potential problem. All kcopyd requests
are submitted with BIO_RW_SYNC.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-25 13:26:57 +01:00
Mikulas Patocka
a2aebe03be dm raid1: use timer
This patch replaces the schedule() in the main kmirrord thread with a timer.
The schedule() could introduce an unwanted delay when work is ready to be
processed.

The code instead calls wake() when there's work to be done immediately, and
delayed_wake() after a failure to give a short delay before retrying.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:56 +01:00
Alasdair G Kergon
a765e20eeb dm: move include files
Publish the dm-io, dm-log and dm-kcopyd headers in include/linux.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:55 +01:00
Alasdair G Kergon
2d1e580afe dm kcopyd: rename
Rename kcopyd.[ch] to dm-kcopyd.[ch].

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:54 +01:00
Alasdair G Kergon
0da336e5fa dm: expose macros
Make dm.h macros and inlines available in include/linux/device-mapper.h

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:53 +01:00
Mikulas Patocka
945fa4d283 dm kcopyd: remove redundant client counting
Remove client counting code that is no longer needed.

Initialization and destruction is made globally from dm_init and dm_exit and is
not based on client counts. Initialization allocates only one empty slab cache,
so there is no negative impact from performing the initialization always,
regardless of whether some client uses kcopyd or not.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:52 +01:00
Mikulas Patocka
08d8757a4d dm kcopyd: private mempool
Change the global mempool in kcopyd into a per-device mempool to avoid
deadlock possibilities.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:50 +01:00
Mikulas Patocka
8c0cbc2f79 dm kcopyd: per device
Make one kcopyd thread per device.

The original shared kcopyd could deadlock.

Configuration:
2008-04-25 13:26:49 +01:00
Jonathan Brassow
2a23aa1ddb dm log: make module use tracking internal
Remove internal module reference fields from the interface.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:48 +01:00
Alasdair G Kergon
b8206bc3de dm log: move register functions
Reorder a couple of functions in the file so the next patch is readable.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:47 +01:00
Heinz Mauelshagen
416cd17b19 dm log: clean interface
Clean up the dm-log interface to prepare for publishing it in include/linux.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:46 +01:00
Heinz Mauelshagen
eb69aca5d3 dm kcopyd: clean interface
Clean up the kcopyd interface to prepare for publishing it in include/linux.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:44 +01:00
Heinz Mauelshagen
22a1ceb1e6 dm io: clean interface
Clean up the dm-io interface to prepare for publishing it in include/linux.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:43 +01:00
Alasdair G Kergon
e01fd7eeb0 dm io: rename error to error_bits
Rename 'error' to 'error_bits' for clarity.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:41 +01:00
Mikulas Patocka
72727bad54 dm snapshot: store pointer to target instance
Save pointer to dm_target in dm_snapshot structure.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:40 +01:00
Heinz Mauelshagen
769aef30f0 dm log: move dirty region log code into separate module
Move the dirty region log code into a separate module so
other targets can share the code.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:39 +01:00
Heinz Mauelshagen
b7fd54a70f dm log: generalise name in messages
Change dm-log.c messages from "mirror log" to "dirty region log" as
a new dm target wants to share this code.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:38 +01:00
Robert P. J. Day
c12bfc923e dm raid1: use list_split_init
Use shorter list_splice_init() for brevity.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:36 +01:00
Milan Broz
8ee2767a59 dm snapshot: reduce default memory allocation
Limit the amount of memory allocated per snapshot on systems
with a large page size.  (The larger default chunk size on
these systems compensates for the smaller number of pages reserved.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:35 +01:00
Mikulas Patocka
924362629b dm snapshot: fix chunksize sector conversion
If a snapshot has a smaller chunksize than the page size the
conversion to pages currently returns 0 instead of 1, causing:
kernel BUG in mempool_resize.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2008-04-25 13:26:34 +01:00
Nick Andrew
fdefa4d87e RAID: remove trailing space from printk line
drivers/md/*.[ch] contains only one more printk line with a trailing space.
Remove it.

Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
2008-04-21 22:42:58 +00:00
Dan Williams
bd2ab67030 md: close a livelock window in handle_parity_checks5
If a failure is detected after a parity check operation has been initiated,
but before it completes handle_parity_checks5 will never quiesce operations on
the stripe.

Explicitly handle this case by "canceling" the parity check, i.e.  clear the
STRIPE_OP_CHECK flags and queue the stripe on the handle list again to refresh
any non-uptodate blocks.

Kernel versions >= 2.6.23 are susceptible.

Cc: <stable@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-11 08:06:44 -07:00
Alasdair G Kergon
4cdc1d1fa5 dm io: write error bits form long not int
write_err is an unsigned long used with set_bit() so should not be passed
around as unsigned int.

http://bugzilla.kernel.org/show_bug.cgi?id=10271

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28 14:45:23 -07:00
Milan Broz
3f1e9070f6 dm crypt: fix ctx pending
Fix regression in dm-crypt introduced in commit
3a7f6c990a ("dm crypt: use async crypto").

If write requests need to be split into pieces, the code must not process them
in parallel because the crypto context cannot be shared.  So there can be
parallel crypto operations on one part of the write, but only one write bio
can be processed at a time.

This is not optimal and the workqueue code needs to be optimized for parallel
processing, but for now it solves the problem without affecting the
performance of synchronous crypto operation (most of current dm-crypt users).

http://bugzilla.kernel.org/show_bug.cgi?id=10242
http://bugzilla.kernel.org/show_bug.cgi?id=10207

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28 14:45:22 -07:00
Andrew Morton
9ea85ebae1 drivers/md/raid5.c: fix printk warnings
gcc-3.4.5 on sparc64:

drivers/md/raid5.c: In function `raid5_end_read_request':
drivers/md/raid5.c:1147: warning: long long unsigned int format, long unsigned int arg (arg 4)
drivers/md/raid5.c:1164: warning: long long unsigned int format, long unsigned int arg (arg 3)
drivers/md/raid5.c:1170: warning: long long unsigned int format, long unsigned int arg (arg 3)

sector_t is u64, and we don't know what type the architecture uses to
implement u64 (on some it is unsigned long).

Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19 18:53:37 -07:00
NeilBrown
0e82989d95 md: remove the 'super' sysfs attribute from devices in an 'md' array
Exposing the binary blob which is the md 'super-block' via sysfs doesn't
really fit with the whole sysfs model, and ever since commit
8118a859dc ("sysfs: fix off-by-one error
in fill_read_buffer()") it doesn't actually work at all (as the size of
the blob is often one page).

(akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG)

So just remove it altogether.  It isn't really useful.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19 18:53:35 -07:00
NeilBrown
7be3dfec47 md: reduce CPU wastage on idle md array with a write-intent bitmap
Recent patch titled
  Reduce CPU wastage on idle md array with a write-intent bitmap.

would sometimes leave the array with dirty bitmap bits that stay dirty.  A
subsequent write would sort things out so it isn't a big problem, but should
be fixed nonetheless.

We need to make sure that when the bitmap becomes not "allclean", the
daemon_sleep really does get set to a sensible value.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10 18:01:19 -07:00
NeilBrown
52720ae77d md: fix formatting error in /proc/mdstat
If an md array is "auto-read-only", then this appears in /proc/mdstat as

   /dev/md0: active(auto-read-only)

whereas if it is truely readonly, it appears as

   /dev/md0: active (read-only)

The difference being a space.

One program known to parse this file expects the space and gets badly
confused.  It will be fixed, but it would be best if what the kernel generates
is more consistent too.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10 18:01:19 -07:00
K.Tanaka
a07e6ab41b md: the md RAID10 resync thread could cause a md RAID10 array deadlock
This message describes another issue about md RAID10 found by testing the
2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:

When a scsi error results in disabling a disk during RAID10 recovery, the
resync threads of md RAID10 could stall.

This case, the raid array has already been broken and it may not matter.  But
I think stall is not preferable.  If it occurs, even shutdown or reboot will
fail because of resource busy.

The deadlock mechanism:

The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
be handled when recovering.  The "remaining" counter is incremented when
building a BIO in sync_request() and is decremented when finish a BIO in
end_sync_write().

If building a BIO fails for some reasons in sync_request(), the "remaining"
should be decremented if it has already been incremented.  I found a case
where this decrement is forgotten.  This causes a md_do_sync() deadlock
because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
end_sync_write() never calls md_done_sync() because of the "remaining" counter
mismatch.

For example, this problem would be reproduced in the following case:

Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
      3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
      [>....................]  recovery =  2.2% (45376/1959808) finish=0.7min speed=45376K/sec

This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.

md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
      3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
      [=>...................]  recovery =  5.0% (99520/1959808) finish=5.9min speed=5237K/sec

 2739 ?        S<     0:17 [md0_raid10]
28608 ?        D<     0:00 [md0_resync]
28629 pts/1    Ss     0:00 bash
28830 pts/1    R+     0:00 ps ax
31819 ?        D<     0:00 [kjournald]

The resync thread keeps working, but actually it is deadlocked.

Patch:
By this patch, the remaining counter will be decremented if needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown
1c830532f6 md: fix possible raid1/raid10 deadlock on read error during resync
Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
another possible deadlock in raid1/raid10 error handing.

If a read request returns an error while a resync is happening and a resync
request is pending, the attempt to fix the error will block until the resync
progresses, and the resync will block until the read request completes.  Thus
a deadlock.

This patch fixes the problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
Keld Simonsen
8ed3a19563 md: don't attempt read-balancing for raid10 'far' layouts
This patch changes the disk to be read for layout "far > 1" to always be the
disk with the lowest block address.

Thus the chunks to be read will always be (for a fully functioning array) from
the first band of stripes, and the raid will then work as a raid0 consisting
of the first band of stripes.

Some advantages:

The fastest part which is the outer sectors of the disks involved will be
used.  The outer blocks of a disk may be as much as 100 % faster than the
inner blocks.

Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.

Mixed disks with different performance characteristics will work better, as
they will work as raid0, the sequential read rate will be number of disks
involved times the IO rate of the slowest disk.

If a disk is malfunctioning, the first disk which is working, and has the
lowest block address for the logical block will be used.

Signed-off-by: Keld Simonsen <keld@dkuug.dk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown
27c529bb8e md: lock access to rdev attributes properly
When we access attributes of an rdev (component device on an md array) through
sysfs, we really need to lock the array against concurrent changes.  We
currently do that when we change an attribute, but not when we read an
attribute.  We need to lock when reading as well else rdev->mddev could become
NULL while we are accessing it.

So add appropriate locking (mddev_lock) to rdev_attr_show.

rdev_size_store requires some extra care as well as it needs to unlock the
mddev while scanning other mddevs for overlapping regions.  We currently
assume that rdev->mddev will still be unchanged after the scan, but that
cannot be certain.  So take a copy of rdev->mddev for use at the end of the
function.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown
2515619823 md: make sure a reshape is started when device switches to read-write
A resync/reshape/recovery thread will refuse to progress when the array is
marked read-only.  So whenever it mark it not read-only, it is important to
wake up thread resync thread.  There is one place we didn't do this.

The problem manifests if the start_ro module parameters is set, and a raid5
array that is in the middle of a reshape (restripe) is started.  The array
will initially be semi-read-only (meaning it acts like it is readonly until
the first write).  So the reshape will not proceed.

On the first write, the array will become read-write, but the reshape will not
be started, and there is no event which will ever restart that thread.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown
d0fae18f1b md: clean up irregularity with raid autodetect
When a raid1 array is stopped, all components currently get added to the list
for auto-detection.  However we should really only add components that were
found by autodetection in the first place.  So add a flag to record that
information, and use it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:18 -08:00
NeilBrown
a1801f858e md: guard against possible bad array geometry in v1 metadata
Make sure the data doesn't start before the end of the superblock when the
superblock is at the start of the device.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:17 -08:00
NeilBrown
8311c29d40 md: reduce CPU wastage on idle md array with a write-intent bitmap
On an md array with a write-intent bitmap, a thread wakes up every few seconds
and scans the bitmap looking for work to do.  If the array is idle, there will
be no work to do, but a lot of scanning is done to discover this.

So cache the fact that the bitmap is completely clean, and avoid scanning the
whole bitmap when the cache is known to be clean.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:17 -08:00
NeilBrown
a35e63efa1 md: fix deadlock in md/raid1 and md/raid10 when handling a read error
When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.

This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).

However write requests need attention from raid1d as bitmap updates might be
required.  This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:17 -08:00
Adrian Bunk
e03f1a8422 dm-raid1.c: fix NULL dereferences
This patch fixes two NULL dereferences introduced by commit
06386bbfd2 and spotted by the Coverity
checker.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-19 15:52:27 -08:00
Jan Blunck
cf28b4863f d_path: Make d_path() use a struct path
d_path() is used on a <dentry,vfsmount> pair.  Lets use a struct path to
reflect this.

[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Bryan Wu <bryan.wu@analog.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:17:09 -08:00
Jan Blunck
c32c2f63a9 d_path: Make seq_path() use a struct path argument
seq_path() is always called with a dentry and a vfsmount from a struct path.
Make seq_path() take it directly as an argument.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:17:08 -08:00
Jan Blunck
1d957f9bf8 Introduce path_put()
* Add path_put() functions for releasing a reference to the dentry and
  vfsmount of a struct path in the right order

* Switch from path_release(nd) to path_put(&nd->path)

* Rename dput_path() to path_put_conditional()

[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Jan Blunck
4ac9137858 Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.

Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
  <dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
  struct path in every place where the stack can be traversed
- it reduces the overall code size:

without patch series:
   text    data     bss     dec     hex filename
5321639  858418  715768 6895825  6938d1 vmlinux

with patch series:
   text    data     bss     dec     hex filename
5320026  858418  715768 6894212  693284 vmlinux

This patch:

Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Al Viro
39ed7adb17 dm-raid1 breakage on 64bit
test_and_set_bit() on address of uint32_t is a Bad Idea(tm)...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 08:16:34 -08:00
Jonathan Brassow
af195ac82e dm raid1: report fault status
This patch adds extra information to the mirror status output, so that
it can be determined which device(s) have failed.  For each mirror device,
a character is printed indicating the most severe error encountered.  The
characters are:
 *    A => Alive - No failures
 *    D => Dead - A write failure occurred leaving mirror out-of-sync
 *    S => Sync - A sychronization failure occurred, mirror out-of-sync
 *    R => Read - A read failure occurred, mirror data unaffected
This allows userspace to properly reconfigure the mirror set.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:39 +00:00
Jonathan Brassow
06386bbfd2 dm raid1: handle read failures
This patch gives the ability to respond-to/record device failures
that happen during read operations.  It also adds the ability to
read from mirror devices that are not the primary if they are
in-sync.

There are essentially two read paths in mirroring; the direct path
and the queued path.  When a read request is mapped, if the region
is 'in-sync' the direct path is taken; otherwise the queued path
is taken.

If the direct path is taken, we must record bio information so that
if the read fails we can retry it.  We then discover the status of
a direct read through mirror_end_io.  If the read has failed, we will
mark the device from which the read was attempted as failed (so we
don't try to read from it again), restore the bio and try again.

If the queued path is taken, we discover the results of the read
from 'read_callback'.  If the device failed, we will mark the device
as failed and attempt the read again if there is another device
where this region is known to be 'in-sync'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:37 +00:00
Jonathan Brassow
b80aa7a0c2 dm raid1: fix EIO after log failure
This patch adds the ability to requeue write I/O to
core device-mapper when there is a log device failure.

If a write to the log produces and error, the pending writes are
put on the "failures" list.  Since the log is marked as failed,
they will stay on the failures list until a suspend happens.

Suspends come in two phases, presuspend and postsuspend.  We must
make sure that all the writes on the failures list are requeued
in the presuspend phase (a requirement of dm core).  This means
that recovery must be complete (because writes may be delayed
behind it) and the failures list must be requeued before we
return from presuspend.

The mechanisms to ensure recovery is complete (or stopped) was
already in place, but needed to be moved from postsuspend to
presuspend.  We rely on 'flush_workqueue' to ensure that the
mirror thread is complete and therefore, has requeued all writes
in the failures list.

Because we are using flush_workqueue, we must ensure that no
additional 'queue_work' calls will produce additional I/O
that we need to requeue (because once we return from
presuspend, we are unable to do anything about it).  'queue_work'
is called in response to the following functions:
- complete_resync_work = NA, recovery is stopped
- rh_dec (mirror_end_io) = NA, only calls 'queue_work' if it
                           is ready to recover the region
                           (recovery is stopped) or it needs
                           to clear the region in the log*
                           **this doesn't get called while
                           suspending**
- rh_recovery_end = NA, recovery is stopped
- rh_recovery_start = NA, recovery is stopped
- write_callback = 1) Writes w/o failures simply call
                   bio_endio -> mirror_end_io -> rh_dec
                   (see rh_dec above)
                   2) Writes with failures are put on
                   the failures list and queue_work is
                   called**
                   ** write_callbacks don't happen
                   during suspend **
- do_failures = NA, 'queue_work' not called if suspending
- add_mirror (initialization) = NA, only done on mirror creation
- queue_bio = NA, 1) delayed I/O scheduled before flush_workqueue
              is called.  2) No more I/Os are being issued.
              3) Re-attempted READs can still be handled.
              (Write completions are handled through rh_dec/
              write_callback - mention above - and do not
              use queue_bio.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:35 +00:00
Jonathan Brassow
8f0205b798 dm raid1: handle recovery failures
This patch adds the calls to 'fail_mirror' if an error occurs during
mirror recovery (aka resynchronization).  'fail_mirror' is responsible
for recording the type of error by mirror device and ensuring an event
gets raised for the purpose of notifying userspace.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:32 +00:00