Move log size validation from mirror target to log constructor.
Removed PAGE_SIZE restriction we no longer think necessary.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change dm_unregister_target to return void and use BUG() for error
reporting.
dm_unregister_target can only fail because of programming bug in the
target driver. It can't fail because of user's behavior or disk errors.
This patch changes unregister_target to return void and use BUG if
someone tries to unregister non-registered target or unregister target
that is in use.
This patch removes code duplication (testing of error codes in all dm
targets) and reports bugs in just one place, in dm_unregister_target. In
some target drivers, these return codes were ignored, which could lead
to a situation where bugs could be missed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Always increase the error count when I/O on a leg of a mirror fails.
The error count is used to decide whether to select an alternative
mirror leg. If the target doesn't use the "handle_errors" feature, the
error count is not updated and the bio can get requeued forever by the
read callback.
Fix it by increasing error_count before the handle_errors feature
checking.
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
We queue work on keventd queue --- so this queue must be flushed in the
destructor. Otherwise, keventd could access mirror_set after it was freed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
Separate the region hash code from raid1 so it can be shared by forthcoming
targets. Use BUG_ON() for failed async dm_io() calls.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
Targets should not need direct access to internal DM structures.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move array_too_big to include/linux/device-mapper.h because it is
used by targets.
Remove the test from dm-raid1 as the number of mirror legs is limited
such that it can never fail. (Even for stripes it seems rather
unlikely.)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-raid1 is setting the 'DM_KCOPYD_IGNORE_ERROR' flag unconditionally
when assigning kcopyd work. kcopyd is responsible for copying an
assigned section of disk to one or more other disks. The
'DM_KCOPYD_IGNORE_ERROR' flag affects kcopyd in the following way:
When not set:
kcopyd will immediately stop the copy operation when an error is
encountered.
When set:
kcopyd will try to proceed regardless of errors and try to continue
copying any remaining amount.
Since dm-raid1 tracks regions of the address space that are (or
are not) in sync and it now has the ability to handle these
errors, we can safely enable this optimization. This optimization
is conditional on whether mirror error handling has been enabled.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.
It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
queue (i.e. block the requests from being dispatched to the physical device).
The queue is unplugged when the caller calls blk_unplug() function. Usually, the
sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
the queue and waits (to be possibly joined with other adjacent bios). Then, when
the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
the disk.
This was happenning:
When doing O_SYNC writes, function fsync_buffers_list() submits a list of
bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
woken up.
fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
there are no bios on the device queue as they are still in the dm_raid1 queue.
wait_on_buffer() starts waiting until the IO is finished.
kmirrord is scheduled, kmirrord takes bios and submits them to the devices.
The submitted bio plugs the harddisk queue but there is no one to unplug it.
(The process that called wait_on_buffer() is already sleeping.)
So there is a 3ms timeout, after which the queues on the harddisks are
unplugged and requests are processed.
This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
dm-raid1 is 10 times slower than md raid1.
Every time we submit something asynchronously via dm_io, we must unplug the
queue actually to send the request to the device.
This patch adds an unplug call to kmirrord - while processing requests, it keeps
the queue plugged (so that adjacent bios can be merged); when it finishes
processing all the bios, it unplugs the queue to submit the bios.
It also fixes kcopyd which has the same potential problem. All kcopyd requests
are submitted with BIO_RW_SYNC.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
This patch replaces the schedule() in the main kmirrord thread with a timer.
The schedule() could introduce an unwanted delay when work is ready to be
processed.
The code instead calls wake() when there's work to be done immediately, and
delayed_wake() after a failure to give a short delay before retrying.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the dm-log interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the kcopyd interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the dm-io interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the dirty region log code into a separate module so
other targets can share the code.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
write_err is an unsigned long used with set_bit() so should not be passed
around as unsigned int.
http://bugzilla.kernel.org/show_bug.cgi?id=10271
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes two NULL dereferences introduced by commit
06386bbfd2 and spotted by the Coverity
checker.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_and_set_bit() on address of uint32_t is a Bad Idea(tm)...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds extra information to the mirror status output, so that
it can be determined which device(s) have failed. For each mirror device,
a character is printed indicating the most severe error encountered. The
characters are:
* A => Alive - No failures
* D => Dead - A write failure occurred leaving mirror out-of-sync
* S => Sync - A sychronization failure occurred, mirror out-of-sync
* R => Read - A read failure occurred, mirror data unaffected
This allows userspace to properly reconfigure the mirror set.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch gives the ability to respond-to/record device failures
that happen during read operations. It also adds the ability to
read from mirror devices that are not the primary if they are
in-sync.
There are essentially two read paths in mirroring; the direct path
and the queued path. When a read request is mapped, if the region
is 'in-sync' the direct path is taken; otherwise the queued path
is taken.
If the direct path is taken, we must record bio information so that
if the read fails we can retry it. We then discover the status of
a direct read through mirror_end_io. If the read has failed, we will
mark the device from which the read was attempted as failed (so we
don't try to read from it again), restore the bio and try again.
If the queued path is taken, we discover the results of the read
from 'read_callback'. If the device failed, we will mark the device
as failed and attempt the read again if there is another device
where this region is known to be 'in-sync'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds the ability to requeue write I/O to
core device-mapper when there is a log device failure.
If a write to the log produces and error, the pending writes are
put on the "failures" list. Since the log is marked as failed,
they will stay on the failures list until a suspend happens.
Suspends come in two phases, presuspend and postsuspend. We must
make sure that all the writes on the failures list are requeued
in the presuspend phase (a requirement of dm core). This means
that recovery must be complete (because writes may be delayed
behind it) and the failures list must be requeued before we
return from presuspend.
The mechanisms to ensure recovery is complete (or stopped) was
already in place, but needed to be moved from postsuspend to
presuspend. We rely on 'flush_workqueue' to ensure that the
mirror thread is complete and therefore, has requeued all writes
in the failures list.
Because we are using flush_workqueue, we must ensure that no
additional 'queue_work' calls will produce additional I/O
that we need to requeue (because once we return from
presuspend, we are unable to do anything about it). 'queue_work'
is called in response to the following functions:
- complete_resync_work = NA, recovery is stopped
- rh_dec (mirror_end_io) = NA, only calls 'queue_work' if it
is ready to recover the region
(recovery is stopped) or it needs
to clear the region in the log*
**this doesn't get called while
suspending**
- rh_recovery_end = NA, recovery is stopped
- rh_recovery_start = NA, recovery is stopped
- write_callback = 1) Writes w/o failures simply call
bio_endio -> mirror_end_io -> rh_dec
(see rh_dec above)
2) Writes with failures are put on
the failures list and queue_work is
called**
** write_callbacks don't happen
during suspend **
- do_failures = NA, 'queue_work' not called if suspending
- add_mirror (initialization) = NA, only done on mirror creation
- queue_bio = NA, 1) delayed I/O scheduled before flush_workqueue
is called. 2) No more I/Os are being issued.
3) Re-attempted READs can still be handled.
(Write completions are handled through rh_dec/
write_callback - mention above - and do not
use queue_bio.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds the calls to 'fail_mirror' if an error occurs during
mirror recovery (aka resynchronization). 'fail_mirror' is responsible
for recording the type of error by mirror device and ensuring an event
gets raised for the purpose of notifying userspace.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch gives mirror the ability to handle device failures
during normal write operations.
The 'write_callback' function is called when a write completes.
If all the writes failed or succeeded, we report failure or
success respectively. If some of the writes failed, we call
fail_mirror; which increments the error count for the device, notes
the type of error encountered (DM_RAID1_WRITE_ERROR), and
selects a new primary (if necessary). Note that the primary
device can never change while the mirror is not in-sync (IOW,
while recovery is happening.) This means that the scenario
where a failed write changes the primary and gives
recovery_complete a chance to misread the primary never happens.
The fact that the primary can change has necessitated the change
to the default_mirror field. We need to protect against reading
garbage while the primary changes. We then add the bio to a new
list in the mirror set, 'failures'. For every bio in the 'failures'
list, we call a new function, '__bio_mark_nosync', where we mark
the region 'not-in-sync' in the log and properly set the region
state as, RH_NOSYNC. Userspace must also be notified of the
failure. This is done by 'raising an event' (dm_table_event()).
If fail_mirror is called in process context the event can be raised
right away. If in interrupt context, the event is deferred to the
kmirrord thread - which raises the event if 'event_waiting' is set.
Backwards compatibility is maintained by ignoring errors if
the DM_FEATURES_HANDLE_ERRORS flag is not present.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Store a pointer to the owning mirror_set structure within each mirror
structure for a subsequent patch to use.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There are now two phases to a suspend in device-mapper -
presuspend and postsuspend. This patch removes the
single 'suspend' in the logging API and replaces it with
'presuspend' and 'postsuspend' functions to align it
better with core device-mapper.
A subsequent patch will make use of 'presuspend'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replacing n & (n - 1) for power of 2 check by is_power_of_2(n)
Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).
Here is a short excerpt of the semantic patch performing
this transformation:
@@
type T2;
expression x;
identifier f,fld;
expression E;
expression E1,E2;
expression e1,e2,e3,y;
statement S;
@@
x =
- kmalloc
+ kzalloc
(E1,E2)
... when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
- memset((T2)x,0,E1);
@@
expression E1,E2,E3;
@@
- kzalloc(E1 * E2,E3)
+ kcalloc(E1,E2,E3)
[akpm@linux-foundation.org: get kcalloc args the right way around]
Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg KH <greg@kroah.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When writing to a mirror, the log must be updated first. Failure
to update the log could result in the log not properly reflecting
the state of the mirror if the machine should crash.
We change the return type of the rh_flush function to give us
the ability to check if a log write was successful. If the
log write was unsuccessful, we fail the writes to avoid the
case where the log does not properly reflect the state of the
mirror.
A follow-up patch - which is dependent on the ability to
requeue I/O's to core device-mapper - will requeue the I/O's
for retry (allowing the mirror to be reconfigured.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Device-mapper mirroring currently takes a best effort approach to
recovery - failures during mirror synchronization are completely ignored.
This means that regions are marked 'in-sync' and 'clean' and removed
from the hash list. Future reads and writes that query the region
will incorrectly interpret the region as in-sync.
This patch handles failures during the recovery process. If a failure
occurs, the region is marked as 'not-in-sync' (aka RH_NOSYNC) and added
to a new list 'failed_recovered_regions'.
Regions on the 'failed_recovered_regions' list are not marked as 'clean'
upon removal from the list. Furthermore, if the DM_RAID1_HANDLE_ERRORS
flag is set, the region is marked as 'not-in-sync'. This action prevents
any future read-balancing from choosing an invalid device because of the
'not-in-sync' status.
If "handle_errors" is not specified when creating a mirror (leaving the
DM_RAID1_HANDLE_ERRORS flag unset), failures will be ignored exactly as they
would be without this patch. This is to preserve backwards compatibility with
user-space tools, such as 'pvmove'. However, since future read-balancing
policies will rely on the correct sync status of a region, a user must choose
"handle_errors" when using read-balancing.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A clear_region function is permitted to block (in practice, rare) but gets
called in rh_update_states() with a spinlock held.
The bits being marked and cleared by the above functions are used
to update the on-disk log, but are never read directly. We can
perform these operations outside the spinlock since the
bits are only changed within one thread viz.
- mark_region in rh_inc()
- clear_region in rh_update_states().
So, we grab the clean_regions list items via list_splice() within the
spinlock and defer clear_region() until we iterate over the list for
deletion - similar to how the recovered_regions list is already handled.
We then move the flush() call down to ensure it encapsulates the changes
which are done by the later calls to clear_region().
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix mirror status line broken in dm-log-report-fault-status.patch:
- space missing between two words
- placeholder ("0") required for compatibility with a subsequent patch
- incorrect offset parameter
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove explicit module name from messages as the macro now includes it
automatically.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The call to rh_in_sync() in do_reads() should be allowed to block. It is in
the mirror worker thread which already permits blocking operations. This will
be needed to support clustered mirroring which will perform network
operations.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the code as it is, it is possible for oustanding clear region requests
never to get flushed when a mirror is deactivated or suspended. This means
there will always be some resync work required when a mirror is activated,
even though it may very well be in-sync.
Always requesting the flush doesn't hurt us. This is because the log tracks
whether any changes occurred and, if not, no flush is performed.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch ports dm-raid1.c to the new dm-io interface.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the ability to specify desired features in the mirror
constructor/mapping table.
The first feature of interest is "handle_errors". Currently, mirroring will
ignore any I/O errors from the devices. Subsequent patches will check for
this flag and handle the errors. If flag/feature is not present, mirror will
do nothing - maintaining backwards compatibility.
Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reports the status of the log device so that userspace can detect
the error and take appropriate action.
Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch replaces the single instance of kmirrord by one instance per mirror
set. This change is required to avoid a deadlock in kmirrord when the
persistent dirty log of a mirror itself resides on a mirror. The single
instance of kmirrord then issues a sync write to the dirty log in write_bits
which gets deferred to kmirrord itself later in the call chain. But kmirrord
never does the deferred work because it is still waiting for the sync
write_bits.
_mirror_sets is removed as it no longer needed, and we always flush the
workqueue before destroying it to ensure all work is complete before
destroying it.
Signed-off-by: Holger Smolinski <smolinski@de.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The complete_resync_work function only provides the ability to change an
out-of-sync region to in-sync. This patch enhances the function to allow us
to change the status from in-sync to out-of-sync as well, something that is
needed when a mirror write to one of the devices or an initial resync on a
given region fails.
Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update existing targets to use the new symbols for return values from target
map and end_io functions.
There is no effect on behaviour.
Test results:
Done build test without errors.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All device-mapper targets must complete outstanding I/O before suspending.
The mirror target generates I/O in its recovery phase and fails to wait for
it. It needs to be tracked so we can ensure that it has completed before we
suspend.
[akpm@osdl.org: cleanup]
Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove trailing space from 'dmsetup table' output.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On an nForce4-equipped machine with two SATA disk in raid1 setup using dmraid,
we experienced frequent deadlock of the system under high i/o load. 'cat
/dev/zero > ~/zero' was the most reliable way to reproduce them: Randomly
after a few GB, 'cp' would be left in 'D' state along with kjournald and
kmirrord. The functions cp and kjournald were blocked in did vary, but
kmirrord's wchan always pointed to 'mempool_alloc()'. We've seen this pattern
on 2.6.15 and 2.6.17 kernels. http://lkml.org/lkml/2005/4/20/142 indicates
that this problem has been around even before.
So much for the facts, here's my interpretation: mempool_alloc() first tries
to atomically allocate the requested memory, or falls back to hand out
preallocated chunks from the mempool. If both fail, it puts the calling
process (kmirrord in this case) on a private waitqueue until somebody refills
the pool. Where the only 'somebody' is kmirrord itself, so we have a
deadlock.
I worked around this problem by falling back to a (blocking) kmalloc when
before kmirrord would have ended up on the waitqueue. This defeats part of
the benefits of using the mempool, but at least keeps the system running. And
it could be done with a two-line change. Note that mempool_alloc() clears the
GFP_NOIO flag internally, and only uses it to decide whether to wait or return
an error if immediate allocation fails, so the attached patch doesn't change
behaviour in the non-deadlocking case. Path is against current git
(2.6.18-rc4), but should apply to earlier versions as well. I've tested on
2.6.15, where this patch makes the difference between random lockup and a
stable system.
Signed-off-by: Daniel Kobras <kobras@linux.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tidy device-mapper error messages to include context information
automatically.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kcopyd should accumulate errors - otherwise I/O failures may be ignored
unintentionally.
And invert 'success' (used in a future patch), using a more intuitive
!(read_err || write_err).
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>