Commit graph

953 commits

Author SHA1 Message Date
Kay Sievers
edfaa7c365 Driver core: convert block from raw kobjects to core devices
This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one
directory. For compatibility /sys/block is created and contains symlinks
to the disks.

  /sys/class/block
  |-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
  |-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1
  |-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10
  |-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5
  |-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6
  |-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7
  |-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8
  |-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9
  `-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

  /sys/block/
  |-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
  `-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:36 -08:00
Greg Kroah-Hartman
3830c62fef Kobject: change drivers/md/md.c to use kobject_init_and_add
Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.

Cc: Neil Brown <neilb@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:29 -08:00
Dan Williams
0f94e87cde md: fix data corruption when a degraded raid5 array is reshaped
We currently do not wait for the block from the missing device to be
computed from parity before copying data to the new stripe layout.

The change in the raid6 code is not techincally needed as we don't delay
data block recovery in the same way for raid6 yet.  But making the change
now is safer long-term.

This bug exists in 2.6.23 and 2.6.24-rc

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-08 16:10:35 -08:00
Milan Broz
91e1062592 dm crypt: use bio_add_page
Fix possible max_phys_segments violation in cloned dm-crypt bio.

In write operation dm-crypt needs to allocate new bio request
and run crypto operation on this clone. Cloned request has always
the same size, but number of physical segments can be increased
and violate max_phys_segments restriction.

This can lead to data corruption and serious hardware malfunction.
This was observed when using XFS over dm-crypt and at least
two HBA controller drivers (arcmsr, cciss) recently.

Fix it by using bio_add_page() call (which tests for other
restrictions too) instead of constructing own biovec.

All versions of dm-crypt are affected by this bug.

Cc: stable@kernel.org
Cc:  dm-crypt@saout.de
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-12-20 17:32:13 +00:00
Neil Brown
91212507f9 dm: merge max_hw_sector
Make sure dm honours max_hw_sectors of underlying devices

  We still have no firm testing evidence in support of this patch but
  believe it may help to resolve some bug reports.  - agk

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-12-20 17:32:12 +00:00
Alasdair G Kergon
69267a30be dm: trigger change uevent on rename
Insert a missing KOBJ_CHANGE notification when a device is renamed.

Cc: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-12-20 17:32:11 +00:00
Milan Broz
adfe47702c dm crypt: fix write endio
Fix BIO_UPTODATE test for write io.

Cc: stable@kernel.org
Cc: dm-crypt@saout.de
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-12-20 17:32:10 +00:00
Paul Mundt
d1622e8909 dm mpath: hp requires scsi
With CONFIG_SCSI=n __scsi_print_sense() is never linked in.

drivers/built-in.o: In function `hp_sw_end_io':
dm-mpath-hp-sw.c:(.text+0x914f8): undefined reference to `__scsi_print_sense'

Caught with a randconfig on current git.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-12-20 17:32:09 +00:00
Jun'ichi Nomura
512875bd96 dm: table detect io beyond device
This patch fixes a panic on shrinking a DM device if there is
outstanding I/O to the part of the device that is being removed.
(Normally this doesn't happen - a filesystem would be resized first,
for example.)

The bug is that __clone_and_map() assumes dm_table_find_target()
always returns a valid pointer.  It may fail if a bio arrives from the
block layer but its target sector is no longer included in the DM
btree.

This patch appends an empty entry to table->targets[] which will
be returned by a lookup beyond the end of the device.

After calling dm_table_find_target(), __clone_and_map() and target_message()
check for this condition using
dm_target_is_valid().

Sample test script to trigger oops:
2007-12-20 17:32:08 +00:00
Dan Williams
6c55be8b96 raid5: fix unending write sequence
<debug output from Joel's system>
handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
check 5: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ffcffcc0 written 0000000000000000
check 4: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fdd4e360 written 0000000000000000
check 3: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000
check 2: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000
check 1: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ff517e40 written 0000000000000000
check 0: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fd4cae60 written 0000000000000000
locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
for sector 7629696, rmw=0 rcw=0
</debug>

These blocks were prepared to be written out, but were never handled in
ops_run_biodrain(), so they remain locked forever.  The operations flags
are all clear which means handle_stripe() thinks nothing else needs to be
done.

This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it
should not have been.  This patch cleans up cases where the code looks at
sh->ops.pending when it should be looking at the consistent stack-based
snapshot of the operations flags.

Report from Joel:
	Resync done. Patch fix this bug.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Joel Bertrand <joel.bertrand@systella.fr>
Cc: <stable@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:39 -08:00
Alan D. Brunelle
2ad8b1ef11 Add UNPLUG traces to all appropriate places
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.

Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-09 13:41:32 +01:00
Neil Brown
def6ae26a9 md: fix misapplied patch in raid5.c
commit 4ae3f847e4 ("md: raid5: fix
clearing of biofill operations") did not get applied correctly,
presumably due to substantial similarities between handle_stripe5 and
handle_stripe6.

This patch moves the chunk of new code from handle_stripe6 (where it isn't
needed (yet)) to handle_stripe5.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Dan Williams" <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-05 15:12:32 -08:00
Vasily Averin
5ec140e600 dm: bounce_pfn limit added
Device mapper uses its own bounce_pfn that may differ from one on underlying
device. In that way dm can build incorrect requests that contain sg elements
greater than underlying device is able to handle.

This is the cause of slab corruption in i2o layer, occurred on i386 arch when
very long direct IO requests are addressed to dm-over-i2o device.

Signed-off-by: Vasily Averin <vvs@sw.ru>
Cc: <stable@kernel.org>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:25 +01:00
Al Viro
ca5cd877ae x86 merge fallout: uml
Don't undef __i386__/__x86_64__ in uml anymore, make sure that (few) places
that required adjusting the ifdefs got those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-29 07:41:32 -07:00
Herbert Xu
68e3f5dd4d [CRYPTO] users: Fix up scatterlist conversion errors
This patch fixes the errors made in the users of the crypto layer during
the sg_init_table conversion.  It also adds a few conversions that were
missing altogether.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-27 00:52:07 -07:00
Jens Axboe
642f149031 SG: Change sg_set_page() to take length and offset argument
Most drivers need to set length and offset as well, so may as well fold
those three lines into one.

Add sg_assign_page() for those two locations that only needed to set
the page, where the offset/length is set outside of the function context.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-24 11:20:47 +02:00
Dan Williams
4ae3f847e4 md: raid5: fix clearing of biofill operations
ops_complete_biofill() runs outside of spin_lock(&sh->lock) and clears the
'pending' and 'ack' bits.  Since the test_and_ack_op() macro only checks
against 'complete' it can get an inconsistent snapshot of pending work.

Move the clearing of these bits to handle_stripe5(), under the lock.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Joel Bertrand <joel.bertrand@systella.fr>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Stable <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-23 08:32:06 -07:00
NeilBrown
85bfb4da8c md: fix an unsigned compare to allow creation of bitmaps with v1.0 metadata
As page->index is unsigned, this all becomes an unsigned comparison,
which almost always returns an error.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Stable <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-23 08:32:06 -07:00
Jens Axboe
45711f1af6 [SG] Update drivers to use sg helpers
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-22 21:19:53 +02:00
Linus Torvalds
c00046c279 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (74 commits)
  fix do_sys_open() prototype
  sysfs: trivial: fix sysfs_create_file kerneldoc spelling mistake
  Documentation: Fix typo in SubmitChecklist.
  Typo: depricated -> deprecated
  Add missing profile=kvm option to Documentation/kernel-parameters.txt
  fix typo about TBI in e1000 comment
  proc.txt: Add /proc/stat field
  small documentation fixes
  Fix compiler warning in smount example program from sharedsubtree.txt
  docs/sysfs: add missing word to sysfs attribute explanation
  documentation/ext3: grammar fixes
  Documentation/java.txt: typo and grammar fixes
  Documentation/filesystems/vfs.txt: typo fix
  include/asm-*/system.h: remove unused set_rmb(), set_wmb() macros
  trivial copy_data_pages() tidy up
  Fix typo in arch/x86/kernel/tsc_32.c
  file link fix for Pegasus USB net driver help
  remove unused return within void return function
  Typo fixes retrun -> return
  x86 hpet.h: remove broken links
  ...
2007-10-19 20:36:17 -07:00
Milan Broz
80fd662683 dm crypt: tidy pending
Add crypt prefix to dec_pending to avoid confusing it in backtraces with
the dm core function of the same name.

No functional change here.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:28 +01:00
Mike Anderson
b15546f942 dm mpath: send uevents
This patch adds calls to dm_path_event for a failed path and a reinstated
path.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:27 +01:00
Mike Anderson
7a8c3d3b92 dm: uevent generate events
This patch adds support for the dm_path_event dm_send_event functions which
create and send udev events.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:26 +01:00
Mike Anderson
51e5b2bd34 dm: add uevent to core
This patch adds a uevent skeleton to device-mapper.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:24 +01:00
Mike Anderson
96a1f7dba6 dm: export name and uuid
This patch adds a function to obtain a copy of a mapped device's name and uuid.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:23 +01:00
Jonathan Brassow
aa5617c553 dm raid1: add mirror_set to struct mirror
Store a pointer to the owning mirror_set structure within each mirror
structure for a subsequent patch to use.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:22 +01:00
Jonathan Brassow
6b3df0d7a5 dm log: split suspend
There are now two phases to a suspend in device-mapper -
presuspend and postsuspend.  This patch removes the
single 'suspend' in the logging API and replaces it with
'presuspend' and 'postsuspend' functions to align it
better with core device-mapper.

A subsequent patch will make use of 'presuspend'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:21 +01:00
Dave Wysochanski
fe97e2aa05 dm mpath: hp retry if not ready
This patch adds retries to the hp hardware handler, and utilizes the
MP_RETRY flag of dm-multipath.  For now in the hp handler, if we get a
pg_init completed with a check condition we just assume we can retry the
pg_init command.  We make this assumption because of incomplete data on
specific check condition code of the HP hardware, and because testing
has shown the HP path initialization command to be idempotent.
The number of times we retry is settable via the "pg_init_retries"
multipath map feature.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:20 +01:00
Dave Wysochanski
16ebbf3584 dm mpath: add hp handler
This patch adds the most basic dm-multipath hardware support for the
HP active/passive arrays.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:19 +01:00
Dave Wysochanski
c9e45581ad dm mpath: add retry pg init
This patch allows a failed path group initialisation command to be retried.

It adds a generic MP_RETRY flag and a "pg_init_retries" feature to
device-mapper multipath which limits the number of retries.

1. A hw handler sends a path initialization command to the storage and
the command completes with an error code indicating the command
should be retried.

2. The hardware handler calls dm_pg_init_complete() with MP_RETRY
set in err_flags to ask the dm multipath core to retry.

3. If the retry limit has not been exceeded, pg_init() is retried.
Otherwise fail_path() is called.

If you are using the userspace multipath-tools or device-mapper-multipath
package, you can set pg_init_retries in the 'device' section of your
/etc/multipath.conf file. For example:

features                "2 pg_init_retries 7"

The number of PG retries attempted is reported in the 'dmsetup status' output.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:18 +01:00
Milan Broz
636d5786c4 dm crypt: tidy labels
Replace numbers with names in labels in error paths, to avoid confusion
when new one get added between existing ones.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:17 +01:00
Milan Broz
d469f84197 dm crypt: tidy whitespace
Clean up, convert some spaces to tabs.

No functional change here.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:15 +01:00
Milan Broz
cabf08e4d3 dm crypt: add post processing queue
Add post-processing queue (per crypt device) for read operations.

Current implementation uses only one queue for all operations
and this can lead to starvation caused by many requests waiting
for memory allocation. But the needed memory-releasing operation
is queued after these requests (in the same queue).

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:14 +01:00
Milan Broz
9934a8bea2 dm crypt: use per device singlethread workqueues
Use a separate single-threaded workqueue for each crypt device
instead of one global workqueue.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:13 +01:00
Alasdair G Kergon
d336416ff1 dm mpath: emc fix an error message
Correct an error message, reported by Michael Wood <michael@frogfoot.com>.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:12 +01:00
Alasdair G Kergon
051814c69f dm: bio_list macro renaming
Remove BIO_LIST and DEFINE_BIO_LIST macros that gain us nothing
since contents are initialised to NULL.

Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:11 +01:00
Jesper Juhl
bb56acf840 dm io:ctl remove vmalloc void cast
In drivers/md/dm-ioctl.c::copy_params() there's a call to vmalloc()
where we currently cast the return value, but that's pretty pointless
given that vmalloc() returns "void *".

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:10 +01:00
Milan Broz
9e4e5f87eb dm: tidy bio_io_error usage
Use bio_io_error() in only two places and tidy the code,
preparing for later patches.

There is no functional change in this patch.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:09 +01:00
Matthias Kaehlcke
def5b5b26e kcopyd use mutex instead of semaphore
Kcopyd uses a semaphore as mutex.  Use the mutex API instead of the (binary)
semaphore,

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-20 02:01:08 +01:00
Dmitry Monakhov
094262db9e dm: use kzalloc
Convert kmalloc() + memset() to kzalloc().

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:07 +01:00
vignesh babu
6f3c3f0afa dm: use is_power_of_2
Replacing n & (n - 1) for power of 2 check by is_power_of_2(n)

Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:06 +01:00
Jun'ichi Nomura
ae9da83f6d dm: fix thaw_bdev
This patch fixes a bd_mount_sem counter corruption bug in device-mapper.

thaw_bdev() should be called only when freeze_bdev() was called for the
device.
Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter.
struct block_device with the corrupted semaphore may remain in slab cache
and be reused later.

Attached patch will fix it by calling unlock_fs() instead.
unlock_fs() will determine whether it should call thaw_bdev()
by checking the device is frozen or not.

Easy reproducer is:
  #!/bin/sh
  while [ 1 ]; do
     dmsetup --notable create a
     dmsetup --nolockfs suspend a
     dmsetup remove a
  done

It's not easy to see the effect of corrupted semaphore.
So I have tested with putting printk below in bdev_alloc_inode():
        if (atomic_read(&ei->bdev.bd_mount_sem.count) != 1)
                printk(KERN_DEBUG "Incorrect semaphore count = %d (%p)\n",
                        atomic_read(&ei->bdev.bd_mount_sem.count),
                        &ei->bdev);

Without the patch, I saw something like:
 Incorrect semaphore count = 17 (f2ab91c0)

With the patch, the message didn't appear.

The bug was introduced in 2.6.16 with this bug fix:

commit d9dde59ba0
Date:   Fri Feb 24 13:04:24 2006 -0800

    [PATCH] dm: missing bdput/thaw_bdev at removal

    Need to unfreeze and release bdev otherwise the bdev inode with
    inconsistent state is reused later and cause problem.

and backported to 2.6.15.5.

It occurs only in free_dev(), which is called only when the dm device is
removed.  The buggy code is executed only if md->suspended_bdev is
non-NULL and that can happen only when the device was suspended without
noflush.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2007-10-20 02:01:05 +01:00
Milan Broz
79662d1ea3 dm delay: fix status
Fix missing space in dm-delay target status output
if separate read and write delay are configured.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:04 +01:00
Dmitry Monakhov
2e64a0f928 dm delay: fix ctr error paths
Add missing 'dm_put_device' to dm-delay target constructor.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:02 +01:00
Dmitry Monakhov
a72cf737e0 dm raid1: fix leakage
Add missing 'dm_io_client_destroy' to alloc_context error path.
Reorganize mirror constructor error path in order to prevent
workqueue leakage.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:01 +01:00
Dmitry Monakhov
815f9e3270 dm crypt: missing kfree in ctr error path
Insert missing kfree() in crypt_iv_essiv_ctr() error path.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:01 +01:00
Dmitry Monakhov
55b42c5ae9 dm crypt: drop device ref in ctr error path
Add a missing 'dm_put_device' in an error path in crypt target constructor.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:00:59 +01:00
Milan Broz
027d50f92e dm io:ctl use constant struct size
Make size of dm_ioctl struct always 312 bytes on all supported
architectures.

This change retains compatibility with already-compiled code because
it uses an embedded offset to locate the payload that follows the
structure.

On 64-bit architectures there is no change at all; on 32-bit
we are increasing the size of dm-ioctl from 308 to 312 bytes.

Currently with 32-bit userspace / 64-bit kernel on x86_64
some ioctls (including rename, message) are incorrectly rejected
by the comparison against 'param + 1'.  This breaks userspace
lvrename and multipath 'fail_if_no_path' changes, for example.

(BTW Device-mapper uses its own versioning and ignores the ioctl
size bits.  Only the generic ioctl compat code on mixed arches
checks them, and that will continue to accept both sizes for now,
but we intend to list 308 as deprecated and eventually remove it.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Guido Guenther <agx@sigxcpu.org>
Cc: Kevin Corry <kevcorry@us.ibm.com>
Cc: stable@kernel.org
2007-10-20 02:00:58 +01:00
Bryn M. Reeves
c7ac86de6a dm mpath: rdac fix init race
Re-order the initialisation of dm-rdac to avoid registering the hw
handler before the workqueue has been initialised. Closes a race
that would potentially give an oops.

Signed-off-by: Bryn M. Reeves <breeves@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:00:57 +01:00
Jan Engelhardt
96de0e252c Convert files to UTF-8 and some cleanups
* Convert files to UTF-8.

  * Also correct some people's names
    (one example is Eißfeldt, which was found in a source file.
    Given that the author used an ß at all in a source file
    indicates that the real name has in fact a 'ß' and not an 'ss',
    which is commonly used as a substitute for 'ß' when limited to
    7bit.)

  * Correct town names (Goettingen -> Göttingen)

  * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19 23:21:04 +02:00
Robert P. J. Day
3a4fa0a25d Fix misspellings of "system", "controller", "interrupt" and "necessary".
Fix the various misspellings of "system", controller", "interrupt" and
"[un]necessary".

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19 23:10:43 +02:00
Pavel Emelyanov
ba25f9dcc4 Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:43 -07:00
NeilBrown
cf7a44168d md: make sure read errors are auto-corrected during a 'check' resync in raid1
Whenever a read error is found, we should attempt to overwrite with correct
data to 'fix' it.

However when do a 'check' pass (which compares data blocks that are
successfully read, but doesn't normally overwrite) we don't do that.  We
should.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Iustin Pop
d7f3d291a0 md: expose the degraded status of an assembled array through sysfs
The 'degraded' attribute is useful to quickly determine if the array is
degraded, instead of parsing 'mdadm -D' output or relying on the other
techniques (number of working devices against number of defined devices,
etc.).  The md code already keeps track of this attribute, so it's useful to
export it.

Signed-off-by: Iustin Pop <iusty@k1024.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
NeilBrown
2b12ab6d33 md: 'sync_action' in sysfs returns wrong value for readonly arrays
When an array is started read-only, MD_RECOVERY_NEEDED can be set but no
recovery will be running.  This causes 'sync_action' to report the wrong
value.

We could remove the test for MD_RECOVERY_NEEDED, but doing so would leave a
small gap after requesting a sync action, where 'sync_action' would still
report the old value.

So make sure that for a read-only array, 'sync_action' always returns 'idle'.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
NeilBrown
8299d7f7c0 md: fix a bug in some never-used code.
http://bugzilla.kernel.org/show_bug.cgi?id=3277

There is a seq_printf here that isn't being passed a 'seq'.  Howeve as the
code is inside #ifdef MD_DEBUG, nobody noticed.

Also remove some extra spaces.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Michael J. Evans
4d936ec1fd md: software Raid autodetect dev list not array
In current release kernels the md module (Software RAID) uses a static
array (dev_t[128]) to store partition/device info temporarily for
autostart.

I discovered this (and that the devices are added as disks/partitions are
discovered at boot) while I was debugging why only one of my MD arrays would
come up whole, while all the others were short a disk.

I eventually discovered that it was enumerating through all of 9 of my 11 hds
(2 had only 4 partitions apiece) while the other 9 have 15 partitions (I
wanted 64 per drive...).  The last partition of the 8th drive in my 9 drive
raid 5 sets wasn't added, thus making the final md array short both a parity
and data disk, and it was started later, elsewhere.

This patch replaces that static array with a list.

[akpm@linux-foundation.org: removed unused var]
Signed-off-by: Michael J. Evans <mjevans1983@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Neil Brown
644bd2f048 Fix memory leak in dm-crypt
dm-crypt used the ->bi_size member in the bio endio handling to
free the appropriate pages, but it frees all of it from both call
paths. With the ->bi_end_io() changes, ->bi_size was always 0 since
we don't do partial completes. This caused dm-crypt to leak memory.

Fix this by removing the size argument from crypt_free_buffer_pages().

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 13:48:46 +02:00
Jens Axboe
fd5d806266 block: convert blkdev_issue_flush() to use empty barriers
Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:05:02 +02:00
Geert Uytterhoeven
0dddfd46f0 dm: emc_endio returns void
emc_endio returns void:
  linux/drivers/md/dm-emc.c: In function 'emc_endio':
  linux/drivers/md/dm-emc.c:58: warning: 'return' with a value, in function returning void

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-13 09:41:03 -07:00
Greg Kroah-Hartman
19c38de88a kobjects: fix up improper use of the kobject name field
A number of different drivers incorrect access the kobject name field
directly.  This is not correct as the name might not be in the array.
Use the proper accessor function instead.
2007-10-12 14:51:02 -07:00
NeilBrown
6712ecf8f6 Drop 'size' argument from bio_endio and bi_end_io
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
NeilBrown
66846572bf Stop exporting blk_rq_bio_prep
blk_rq_bio_prep is exported for use in exactly
one place.  That place can benefit from using
the new blk_rq_append_bio instead.
So
  - change dm-emc to call blk_rq_append_bio
  - stop exporting blk_rq_bio_prep, and
  - initialise rq_disk in blk_rq_bio_prep,
       as dm-emc needs it.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
Dan Williams
e4d84909dd raid5: fix 2 bugs in ops_complete_biofill
1/ ops_complete_biofill tried to avoid calling handle_stripe since all the
state necessary to return read completions is available.  However the
process of determining whether more read requests are pending requires
locking the stripe (to block add_stripe_bio from updating dev->toead).
ops_complete_biofill can run in tasklet context, so rather than upgrading
all the stripe locks from spin_lock to spin_lock_bh this patch just
unconditionally reschedules handle_stripe after completing the read
request.

2/ ops_complete_biofill needlessly qualified processing R5_Wantfill with
dev->toread.  The result being that the 'biofill' pending bit is cleared
before handling the pending read-completions on dev->read.  R5_Wantfill can
be unconditionally handled because the 'biofill' pending bit prevents new
R5_Wantfill requests from being seen by ops_run_biofill and
ops_complete_biofill.

Found-by: Yuri Tikhonov <yur@emcraft.com>
[neilb@suse.de: simpler fix for bug 1 than moving code]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2007-09-24 13:23:35 -07:00
aherrman@arcor.de
2123a09f3f Fix kernel buuild with (CONFIG_COMPAT && ! CONFIG_BLOCK)
Commit 02a5e0acb3 ("BLOCK: Hide the
contents of linux/bio.h if CONFIG_BLOCK=n") broke the kernel build for
the CONFIG_COMPAT && !CONFIG_BLOCK case:

    CC      fs/compat_ioctl.o
  In file included from include/linux/raid/md_k.h:19,
                   from include/linux/raid/md.h:54,
                   from fs/compat_ioctl.c:25:
  include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_:
  include/linux/raid/../../../drivers/md/dm-bio-list.h:40: error: dereferencing pointer to incomplete type
  include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_:
  include/linux/raid/../../../drivers/md/dm-bio-list.h:48: error: dereferencing pointer to incomplete type
  include/linux/raid/../../../drivers/md/dm-bio-list.h:51: error: dereferencing pointer to incomplete type
  include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_:
  include/linux/raid/../../../drivers/md/dm-bio-list.h:64: error: dereferencing pointer to incomplete type
  include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_merge_:
  include/linux/raid/../../../drivers/md/dm-bio-list.h:78: error: dereferencing pointer to incomplete type
  include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_:
  include/linux/raid/../../../drivers/md/dm-bio-list.h:90: error: dereferencing pointer to incomplete type
  include/linux/raid/../../../drivers/md/dm-bio-list.h:94: error: dereferencing pointer to incomplete type
  make[1]: *** [fs/compat_ioctl.o] Error 1
  make: *** [fs] Error 2

Signed-off-by: Andreas Herrmann <aherrman@arcor.de>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-14 13:56:47 -07:00
NeilBrown
a2e0855182 md: fix some bugs with growing raid5/raid6 arrays.
The recent changed to raid5 to allow offload of parity calculation etc
introduced some bugs in the code for growing (i.e.  adding a disk to) raid5
and raid6.  This fixes them

Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:19 -07:00
Andrew Vasquez
f99ba18a96 dm-mpath-rdac: don't stomp on a requests transfer bit
Without this, we get qla2xxx complaining about "ISP System Error".

What's happening here is the firmware is detecting a Xfer-ready from the
storage when in fact the data-direction for a mode-select should be a
write (DATA_OUT).

The following patch fixes the problem (typo). Verified by Brian, as
well.

Signed-off-by: Andrew Vasquez <andrew.vasquez@qlogic.com>
Verified-by: Brian De Wolf <bldewolf@csupomona.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-27 16:15:44 -07:00
Randy Dunlap
6e106b0d97 DM_MULTIPATH_RDAC: "scsi_normalize_sense" undefined
DM_MULTIPATH_RDAC uses SCSI API(s) and is for a SCSI device,
so add SCSI to its depends on to prevent build errors.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
[ Tested and Verified by Chandra Seetharaman ]
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-24 16:10:39 -07:00
NeilBrown
a88aa7865b md: correctly update sysfs when a raid1 is reshaped
When a raid1 array is reshaped (number of drives changed), the list of devices
is compacted, so that slots for missing devices are filled with working
devices from later slots.  This requires the "rd%d" symlinks in sysfs to be
updated.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:46 -07:00
NeilBrown
918f02383f md: make sure a re-add after a restart honours bitmap when resyncing
Commit 1757128438 was slightly bad.  If an array
has a write-intent bitmap, and you remove a drive, then readd it, only the
changed parts should be resynced.  However after the above commit, this only
works if the array has not been shut down and restarted.

This is because it sets 'fullsync' at little more often than it should.  This
patch is more careful.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:46 -07:00
Alan D. Brunelle
c7149d6bce Fix remap handling by blktrace
This patch provides more information concerning REMAP operations on block
IOs. The additional information provides clearer details at the user level,
and supports post-processing analysis in btt.

o  Adds in partition remaps on the same device.
o  Fixed up the remap information in DM to be in the right order
o  Sent up mapped-from and mapped-to device information

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-08-11 22:34:48 +02:00
Arne Redlich
f6f953aa99 md: handle writes to broken raid10 arrays gracefully
When writing to a broken array, raid10 currently happily emits empty bio
lists.  IOW, the master bio will never be completed, sending writers to
UNINTERRUPTIBLE_SLEEP forever.

Signed-off-by: Arne Redlich <agr@powerkom-dd.de>
Acked-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-31 15:39:38 -07:00
Maik Hampel
14e713446a md: raid10: fix use-after-free of bio
In case of read errors raid10d tries to print a nice error message,
unfortunately using data from an already put bio.

Signed-off-by: Maik Hampel <m.hampel@gmx.de>
Acked-By: NeilBrown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-31 15:39:38 -07:00
Jens Axboe
165125e1e4 [BLOCK] Get rid of request_queue_t typedef
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-24 09:28:11 +02:00
Milan Broz
80b16c192e dm io: fix panic on large request
Flush workqueue before releasing bioset and mopools in dm-crypt.  There can
be finished but not yet released request.

Call chain causing oops:
  run workqueue
    dec_pending
      bio_endio(...);
      	<remove device request - remove mempool>
      mempool_free(io, cc->io_pool);

This usually happens when cryptsetup create temporary
luks mapping in the beggining of crypt device activation.

When dm-core calls destructor crypt_dtr, no new request
are possible.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-21 17:49:14 -07:00
Dan Williams
eb0645a8b1 async_tx: fix kmap_atomic usage in async_memcpy
Andrew Morton:
	[async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and
	ASYNC_TX_KMAP_SRC can ever be set.  We'll end up using the same kmap
	slot for both src add dest and we get either corrupted data or a BUG.

Evgeniy Polyakov:
	Btw, shouldn't it always be kmap_atomic() even if flag is not set.
	That pages are usual one returned by alloc_page().

So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and
ASYNC_TX_KMAP_SRC flags.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20 08:44:19 -07:00
Paul Mundt
20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Yoann Padioleau
dd00cc486a some kmalloc/memset ->kzalloc (tree wide)
Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).

Here is a short excerpt of the semantic patch performing
this transformation:

@@
type T2;
expression x;
identifier f,fld;
expression E;
expression E1,E2;
expression e1,e2,e3,y;
statement S;
@@

 x =
- kmalloc
+ kzalloc
  (E1,E2)
  ...  when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
- memset((T2)x,0,E1);

@@
expression E1,E2,E3;
@@

- kzalloc(E1 * E2,E3)
+ kcalloc(E1,E2,E3)

[akpm@linux-foundation.org: get kcalloc args the right way around]
Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg KH <greg@kroah.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:50 -07:00
Jesper Juhl
851a8a7fd4 dm: fix memory leak in dm_create_persistent() when starting metadata update thread fails
If, in dm_create_persistent(), the call to create_singlethread_workqueue()
fails then we'll return without freeing the memory allocated to 'ps', thus
leaking sizeof(struct pstore) bytes.  This patch fixes the leak.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18 08:38:22 -07:00
NeilBrown
4ad1366376 md: change bitmap_unplug and others to void functions
bitmap_unplug only ever returns 0, so it may as well be void.  Two callers try
to print a message if it returns non-zero, but that message is already printed
by bitmap_file_kick.

write_page returns an error which is not consistently checked.  It always
causes BITMAP_WRITE_ERROR to be set on an error, and that can more
conveniently be checked.

When the return of write_page is checked, an error causes bitmap_file_kick to
be called - so move that call into write_page - and protect against recursive
calls into bitmap_file_kick.

bitmap_update_sb returns an error that is never checked.

So make these 'void' and be consistent about checking the bit.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
NeilBrown
f0d76d70bc md: check that internal bitmap does not overlap other data
We current completely trust user-space to set up metadata describing an
consistant array.  In particlar, that the metadata, data, and bitmap do not
overlap.

But userspace can be buggy, and it is better to report an error than corrupt
data.  So put in some appropriate checks.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
NeilBrown
713f6ab18b md: improve the is_mddev_idle test fix
Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing
we want to do with them is a signed comparison, and fix up the comment which
had become quite wrong.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
NeilBrown
df968c4e8d md: improve message about invalid superblock during autodetect
People try to use raid auto-detect with version-1 superblocks (which is not
supported) and get confused when they are told they have an invalid
superblock.

So be more explicit, and say it it is not a valid v0.90 superblock.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
Jan Engelhardt
afd44034ac Use menuconfig objects II - MD
Change Kconfig objects from "menu, config" into "menuconfig" so
that the user can disable the whole feature without having to
enter the menu first.

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:15 -07:00
Akinobu Mita
00d59405cf unregister_blkdev() delete redundant messages in callers
No need to warn unregister_blkdev() failure by the callers.  (The previous
patch makes unregister_blkdev() print error message in error case)

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:03 -07:00
Rafael J. Wysocki
8314418629 Freezer: make kernel threads nonfreezable by default
Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves.  This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.

It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.

The patch causes all kernel threads to be nonfreezable by default (ie.  to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE.  It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear.  Additionally, it updates documentation to
describe the freezing of tasks more accurately.

[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
Linus Torvalds
e030dbf91a Merge branch 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop
* 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop: (28 commits)
  ioatdma: add the unisys "i/oat" pci vendor/device id
  ARM: Add drivers/dma to arch/arm/Kconfig
  iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver
  iop13xx: surface the iop13xx adma units to the iop-adma driver
  dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
  md: remove raid5 compute_block and compute_parity5
  md: handle_stripe5 - request io processing in raid5_run_ops
  md: handle_stripe5 - add request/completion logic for async expand ops
  md: handle_stripe5 - add request/completion logic for async read ops
  md: handle_stripe5 - add request/completion logic for async check ops
  md: handle_stripe5 - add request/completion logic for async compute ops
  md: handle_stripe5 - add request/completion logic for async write ops
  md: common infrastructure for running operations with raid5_run_ops
  md: raid5_run_ops - run stripe operations outside sh->lock
  raid5: replace custom debug PRINTKs with standard pr_debug
  raid5: refactor handle_stripe5 and handle_stripe6 (v3)
  async_tx: add the async_tx api
  xor: make 'xor_blocks' a library routine for use with async_tx
  dmaengine: make clients responsible for managing channels
  dmaengine: refactor dmaengine around dma_async_tx_descriptor
  ...
2007-07-13 10:52:27 -07:00
Dan Williams
f6dff381af md: remove raid5 compute_block and compute_parity5
replaced by raid5_run_ops

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:18 -07:00
Dan Williams
830ea01673 md: handle_stripe5 - request io processing in raid5_run_ops
I/O submission requests were already handled outside of the stripe lock in
handle_stripe.  Now that handle_stripe is only tasked with finding work,
this logic belongs in raid5_run_ops.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams
f0a50d3754 md: handle_stripe5 - add request/completion logic for async expand ops
When a stripe is being expanded bulk copying takes place to move the data
from the old stripe to the new.  Since raid5_run_ops only operates on one
stripe at a time these bulk copies are handled in-line under the stripe
lock.  In the dma offload case we poll for the completion of the operation.

After the data has been copied into the new stripe the parity needs to be
recalculated across the new disks.  We reuse the existing postxor
functionality to carry out this calculation.  By setting STRIPE_OP_POSTXOR
without setting STRIPE_OP_BIODRAIN the completion path in handle stripe
can differentiate expand operations from normal write operations.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams
b5e98d65d3 md: handle_stripe5 - add request/completion logic for async read ops
When a read bio is attached to the stripe and the corresponding block is
marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy
the data from the stripe cache to the bio buffer.  handle_stripe flags the
blocks to be operated on with the R5_Wantfill flag.  If new read requests
arrive while raid5_run_ops is running they will not be handled until
handle_stripe is scheduled to run again.

Changelog:
* cleanup to_read and to_fill accounting
* do not fail reads that have reached the cache

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams
e89f89629b md: handle_stripe5 - add request/completion logic for async check ops
Check operations are scheduled when the array is being resynced or an
explicit 'check/repair' command was sent to the array.  Previously check
operations would destroy the parity block in the cache such that even if
parity turned out to be correct the parity block would be marked
!R5_UPTODATE at the completion of the check.  When the operation can be
carried out by a dma engine the assumption is that it can check parity as a
read-only operation.  If raid5_run_ops notices that the check was handled
by hardware it will preserve the R5_UPTODATE status of the parity disk.

When a check operation determines that the parity needs to be repaired we
reuse the existing compute block infrastructure to carry out the operation.
Repair operations imply an immediate write back of the data, so to
differentiate a repair from a normal compute operation the
STRIPE_OP_MOD_REPAIR_PD flag is added.

Changelog:
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams
f38e12199a md: handle_stripe5 - add request/completion logic for async compute ops
handle_stripe will compute a block when a backing disk has failed, or when
it determines it can save a disk read by computing the block from all the
other up-to-date blocks.

Previously a block would be computed under the lock and subsequent logic in
handle_stripe could use the newly up-to-date block.  With the raid5_run_ops
implementation the compute operation is carried out a later time outside
the lock.  To preserve the old functionality we take advantage of the
dependency chain feature of async_tx to flag the block as R5_Wantcompute
and then let other parts of handle_stripe operate on the block as if it
were up-to-date.  raid5_run_ops guarantees that the block will be ready
before it is used in another operation.

However, this only works in cases where the compute and the dependent
operation are scheduled at the same time.  If a previous call to
handle_stripe sets the R5_Wantcompute flag there is no facility to pass the
async_tx dependency chain across successive calls to raid5_run_ops.  The
req_compute variable protects against this case.

Changelog:
* remove the req_compute BUG_ON

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:17 -07:00
Dan Williams
e33129d841 md: handle_stripe5 - add request/completion logic for async write ops
After handle_stripe5 decides whether it wants to perform a
read-modify-write, or a reconstruct write it calls
handle_write_operations5.  A read-modify-write operation will perform an
xor subtraction of the blocks marked with the R5_Wantprexor flag, copy the
new data into the stripe (biodrain) and perform a postxor operation across
all up-to-date blocks to generate the new parity.  A reconstruct write is run
when all blocks are already up-to-date in the cache so all that is needed
is a biodrain and postxor.

On the completion path STRIPE_OP_PREXOR will be set if the operation was a
read-modify-write.  The STRIPE_OP_BIODRAIN flag is used in the completion
path to differentiate write-initiated postxor operations versus
expansion-initiated postxor operations.  Completion of a write triggers i/o
to the drives.

Changelog:
* make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil Brown
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:16 -07:00
Dan Williams
d84e0f10d3 md: common infrastructure for running operations with raid5_run_ops
All the handle_stripe operations that are to be transitioned to use
raid5_run_ops need a method to coherently gather work under the stripe-lock
and hand that work off to raid5_run_ops.  The 'get_stripe_work' routine
runs under the lock to read all the bits in sh->ops.pending that do not
have the corresponding bit set in sh->ops.ack.  This modified 'pending'
bitmap is then passed to raid5_run_ops for processing.

The transition from 'ack' to 'completion' does not need similar protection
as the existing release_stripe infrastructure will guarantee that
handle_stripe will run again after a completion bit is set, and
handle_stripe can tolerate a sh->ops.completed bit being set while the lock
is held.

A call to async_tx_issue_pending_all() is added to raid5d to kick the
offload engines once all pending stripe operations work has been submitted.
This enables batching of the submission and completion of operations.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:16 -07:00
Dan Williams
91c0092484 md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following
attack plan:

1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api

The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.

To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests.  In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing.  The following flags outline the
requests that handle_stripe can make of raid5_run_ops:

STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
 - subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct
STRIPE_OP_IO
 - submit i/o to the member disks (note this was already performed outside
   the stripe lock, but it made sense to add it as an operation type

The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
   operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
   sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
   new operations that were previously blocked

Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.

Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
  ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
  handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
  was absorbed (i.e. it is now implicit) by the async_tx api
* use return_io in ops_complete_biofill

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:15 -07:00
Dan Williams
45b4233caa raid5: replace custom debug PRINTKs with standard pr_debug
Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in
favor of the global DEBUG definition.  To get local debug messages just add
'#define DEBUG' to the top of the file.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:15 -07:00
Dan Williams
a445685647 raid5: refactor handle_stripe5 and handle_stripe6 (v3)
handle_stripe5 and handle_stripe6 have very deep logic paths handling the
various states of a stripe_head.  By introducing the 'stripe_head_state'
and 'r6_state' objects, large portions of the logic can be moved to
sub-routines.

'struct stripe_head_state' consumes all of the automatic variables that previously
stood alone in handle_stripe5,6.  'struct r6_state' contains the handle_stripe6
specific variables like p_failed and q_failed.

One of the nice side effects of the 'stripe_head_state' change is that it
allows for further reductions in code duplication between raid5 and raid6.
The following new routines are shared between raid5 and raid6:

	handle_completed_write_requests
	handle_requests_to_failed_array
	handle_stripe_expansion

Changes:
* v2: fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path
* v3: removed the unused 'dirty' field from struct stripe_head_state
* v3: coalesced open coded bi_end_io routines into return_io()

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:15 -07:00
Dan Williams
9bc89cd82d async_tx: add the async_tx api
The async_tx api provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies.  It is implemented as a dmaengine client that smooths over
the details of different hardware offload engine implementations.  Code
that is written to the api can optimize for asynchronous operation and the
api will fit the chain of operations to the available offload resources. 
 
	I imagine that any piece of ADMA hardware would register with the
	'async_*' subsystem, and a call to async_X would be routed as
	appropriate, or be run in-line. - Neil Brown

async_tx exploits the capabilities of struct dma_async_tx_descriptor to
provide an api of the following general format:

struct dma_async_tx_descriptor *
async_<operation>(..., struct dma_async_tx_descriptor *depend_tx,
			dma_async_tx_callback cb_fn, void *cb_param)
{
	struct dma_chan *chan = async_tx_find_channel(depend_tx, <operation>);
	struct dma_device *device = chan ? chan->device : NULL;
	int int_en = cb_fn ? 1 : 0;
	struct dma_async_tx_descriptor *tx = device ?
		device->device_prep_dma_<operation>(chan, len, int_en) : NULL;

	if (tx) { /* run <operation> asynchronously */
		...
		tx->tx_set_dest(addr, tx, index);
		...
		tx->tx_set_src(addr, tx, index);
		...
		async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
	} else { /* run <operation> synchronously */
		...
		<operation>
		...
		async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
	}

	return tx;
}

async_tx_find_channel() returns a capable channel from its pool.  The
channel pool is organized as a per-cpu array of channel pointers.  The
async_tx_rebalance() routine is tasked with managing these arrays.  In the
uniprocessor case async_tx_rebalance() tries to spread responsibility
evenly over channels of similar capabilities.  For example if there are two
copy+xor channels, one will handle copy operations and the other will
handle xor.  In the SMP case async_tx_rebalance() attempts to spread the
operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
channel0 while cpu1 gets copy channel 1 and xor channel 1.  When a
dependency is specified async_tx_find_channel defaults to keeping the
operation on the same channel.  A xor->copy->xor chain will stay on one
channel if it supports both operation types, otherwise the transaction will
transition between a copy and a xor resource.

Currently the raid5 implementation in the MD raid456 driver has been
converted to the async_tx api.  A driver for the offload engines on the
Intel Xscale series of I/O processors, iop-adma, is provided in a later
commit.  With the iop-adma driver and async_tx, raid456 is able to offload
copy, xor, and xor-zero-sum operations to hardware engines.
 
On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
improvement) and sequential reads to a degraded array (40 - 55%
improvement).  For the other cases performance was roughly equal, +/- a few
percentage points.  On a x86-smp platform the performance of the async_tx
implementation (in synchronous mode) was also +/- a few percentage points
of the original implementation.  According to 'top' on iop342 CPU
utilization drops from ~50% to ~15% during a 'resync' while the speed
according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.
 
The tiobench command line used for testing was: tiobench --size 2048
--block 4096 --block 131072 --dir /mnt/raid --numruns 5
* iop342 had 1GB of memory available

Details:
* if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
  async_tx_find_channel a static inline routine that always returns NULL
* when a callback is specified for a given transaction an interrupt will
  fire at operation completion time and the callback will occur in a
  tasklet.  if the the channel does not support interrupts then a live
  polling wait will be performed
* the api is written as a dmaengine client that requests all available
  channels
* In support of dependencies the api implicitly schedules channel-switch
  interrupts.  The interrupt triggers the cleanup tasklet which causes
  pending operations to be scheduled on the next channel
* Xor engines treat an xor destination address differently than a software
  xor routine.  To the software routine the destination address is an implied
  source, whereas engines treat it as a write-only destination.  This patch
  modifies the xor_blocks routine to take a an explicit destination address
  to mirror the hardware.

Changelog:
* fixed a leftover debug print
* don't allow callbacks in async_interrupt_cond
* fixed xor_block changes
* fixed usage of ASYNC_TX_XOR_DROP_DEST
* drop dma mapping methods, suggested by Chris Leech
* printk warning fixups from Andrew Morton
* don't use inline in C files, Adrian Bunk
* select the API when MD is enabled
* BUG_ON xor source counts <= 1
* implicitly handle hardware concerns like channel switching and
  interrupts, Neil Brown
* remove the per operation type list, and distribute operation capabilities
  evenly amongst the available channels
* simplify async_tx_find_channel to optimize the fast path
* introduce the channel_table_initialized flag to prevent early calls to
  the api
* reorganize the code to mimic crypto
* include mm.h as not all archs include it in dma-mapping.h
* make the Kconfig options non-user visible, Adrian Bunk
* move async_tx under crypto since it is meant as 'core' functionality, and
  the two may share algorithms in the future
* move large inline functions into c files
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-By: NeilBrown <neilb@suse.de>
2007-07-13 08:06:14 -07:00
Dan Williams
685784aaf3 xor: make 'xor_blocks' a library routine for use with async_tx
The async_tx api tries to use a dma engine for an operation, but will fall
back to an optimized software routine otherwise.  Xor support is
implemented using the raid5 xor routines.  For organizational purposes this
routine is moved to a common area.

The following fixes are also made:
* rename xor_block => xor_blocks, suggested by Adrian Bunk
* ensure that xor.o initializes before md.o in the built-in case
* checkpatch.pl fixes
* mark calibrate_xor_blocks __init, Adrian Bunk

Cc: Adrian Bunk <bunk@stusta.de>
Cc: NeilBrown <neilb@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2007-07-13 08:06:14 -07:00
Chandra Seetharaman
dd172d72ad dm mpath: rdac
This patch supports LSI/Engenio devices in RDAC mode. Like dm-emc
it requires userspace support. In your multipath.conf file you must have:

path_checker            rdac
hardware_handler        "1 rdac"
prio_callout		"/sbin/mpath_prio_tpc /dev/%n"

And you also then must have a updated multipath tools release which
has rdac support.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:23 -07:00
Jonathan Brassow
fc1ff9588a dm raid1: handle log failure
When writing to a mirror, the log must be updated first.  Failure
to update the log could result in the log not properly reflecting
the state of the mirror if the machine should crash.

We change the return type of the rh_flush function to give us
the ability to check if a log write was successful.  If the
log write was unsuccessful, we fail the writes to avoid the
case where the log does not properly reflect the state of the
mirror.

A follow-up patch - which is dependent on the ability to
requeue I/O's to core device-mapper - will requeue the I/O's
for retry (allowing the mirror to be reconfigured.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Jonathan Brassow
f44db678ed dm raid1: handle resync failures
Device-mapper mirroring currently takes a best effort approach to
recovery - failures during mirror synchronization are completely ignored.
This means that regions are marked 'in-sync' and 'clean' and removed
from the hash list.  Future reads and writes that query the region
will incorrectly interpret the region as in-sync.

This patch handles failures during the recovery process.  If a failure
occurs, the region is marked as 'not-in-sync' (aka RH_NOSYNC) and added
to a new list 'failed_recovered_regions'.

Regions on the 'failed_recovered_regions' list are not marked as 'clean'
upon removal from the list.  Furthermore, if the DM_RAID1_HANDLE_ERRORS
flag is set, the region is marked as 'not-in-sync'.  This action prevents
any future read-balancing from choosing an invalid device because of the
'not-in-sync' status.

If "handle_errors" is not specified when creating a mirror (leaving the
DM_RAID1_HANDLE_ERRORS flag unset), failures will be ignored exactly as they
would be without this patch.  This is to preserve backwards compatibility with
user-space tools, such as 'pvmove'.  However, since future read-balancing
policies will rely on the correct sync status of a region, a user must choose
"handle_errors" when using read-balancing.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Jonathan Brassow
d0d444c7d4 dm: add ratelimit logging macros
Add ratelimit extension to dm logging macros.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Stefan Bader
07a83c47cf dm: disable barriers
This patch causes device-mapper to reject any barrier requests.  This is done
since most of the targets won't handle this correctly anyway.  So until the
situation improves it is better to reject these requests at the first place.
Since barrier requests won't get to the targets, the checks there can be
removed.

Cc: stable@kernel.org
Signed-off-by: Stefan Bader <shbader@de.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Jonathan Brassow
943317efdb dm raid1: clear region outside spinlock
A clear_region function is permitted to block (in practice, rare) but gets
called in rh_update_states() with a spinlock held.

The bits being marked and cleared by the above functions are used
to update the on-disk log, but are never read directly.  We can
perform these operations outside the spinlock since the
bits are only changed within one thread viz.
   - mark_region in rh_inc()
   - clear_region in rh_update_states().

So, we grab the clean_regions list items via list_splice() within the
spinlock and defer clear_region() until we iterate over the list for
deletion - similar to how the recovered_regions list is already handled.
We then move the flush() call down to ensure it encapsulates the changes
which are done by the later calls to clear_region().

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Milan Broz
0764147b11 dm snapshot: permit invalid activation
Allow invalid snapshots to be activated instead of failing.

This allows userspace to reinstate any given snapshot state - for
example after an unscheduled reboot - and clean up the invalid snapshot
at its leisure.

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Milan Broz
fcac03abd3 dm snapshot: fix invalidation deadlock
Process persistent exception store metadata IOs in a separate thread.

A snapshot may become invalid while inside generic_make_request().
A synchronous write is then needed to update the metadata while still
inside that function.  Since the introduction of
md-dm-reduce-stack-usage-with-stacked-block-devices.patch this has to
be performed by a separate thread to avoid deadlock.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Jun'ichi Nomura
596f138eed dm io: fix panic on large request
bio_alloc_bioset() will return NULL if 'num_vecs' is too large.
Use bio_get_nr_vecs() to get estimation of maximum number.

Cc: stable@kernel.org
Signed-off-by: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Milan Broz
c95bc206da dm raid1: fix status
Fix mirror status line broken in dm-log-report-fault-status.patch:
  - space missing between two words
  - placeholder ("0") required for compatibility with a subsequent patch
  - incorrect offset parameter

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Alasdair G Kergon
0cd3312434 dm: remove duplicate module name from error msgs
Remove explicit module name from messages as the macro now includes it
automatically.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Alasdair G Kergon
ac818646d4 dm delay: cleanup
Use setup_timer().
Replace semaphore with mutex.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Alasdair G Kergon
028867ac28 dm: use kmem_cache macro
Use new KMEM_CACHE() macro and make the newly-exposed structure names more
meaningful.  Also remove some superfluous casts and inlines (let a modern
compiler be the judge).

Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Alasdair G Kergon
79e15ae424 dm: bio_list prefetch removal
Remove dubious prefetch from bio_list_for_each() macro.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Mike Accetta
ed45666271 md: fix bug in error handling during raid1 repair
If raid1/repair (which reads all block and fixes any differences it finds)
hits a read error, it doesn't reset the bio for writing before writing
correct data back, so the read error isn't fixed, and the device probably
gets a zero-length write which it might complain about.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
NeilBrown
af03b8e4e8 md: fix two raid10 bugs
1/ When resyncing a degraded raid10 which has more than 2 copies of each block,
  garbage can get synced on top of good data.

2/ We round the wrong way in part of the device size calculation, which
  can cause confusion.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
NeilBrown
a778b73ff7 md: fix bug with linear hot-add and elsewhere
Adding a drive to a linear array seems to have stopped working, due to changes
elsewhere in md, and insufficient ongoing testing...

So the patch to make linear hot-add work in the first place introduced a
subtle bug elsewhere that interracts poorly with older version of mdadm.

This fixes it all up.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:14 -07:00
NeilBrown
ab6085c795 md: don't write more than is required of the last page of a bitmap
It is possible that real data or metadata follows the bitmap without full page
alignment.

So limit the last write to be only the required number of bytes, rounded up to
the hard sector size of the device.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:14 -07:00
NeilBrown
787f17feb2 md: avoid overflow in raid0 calculation with large components
If a raid0 has a component device larger than 4TB, and is accessed on a 32bit
machines, then as 'chunk' is unsigned long,

   chunk << chunksize_bits

can overflow (this can be as high as the size of the device in KB).  chunk
itself will not overflow (without triggering a BUG).

So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes
impossible to hit.

Cc: "Jeff Zheng" <Jeff.Zheng@endace.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:14 -07:00
NeilBrown
435b71be20 md: improve the is_mddev_idle test
During a 'resync' or similar activity, md checks if the devices in the
array are otherwise active and winds back resync activity when they are.
This test in done in is_mddev_idle, and it is somewhat fragile - it
sometimes thinks there is non-sync io when there isn't.

The test compares the total sectors of io (disk_stat_read) with the sectors
of resync io (disk->sync_io).  This has problems because total sectors gets
updated when a request completes, while resync io gets updated when the
request is submitted.  The time difference can cause large differenced
between the two which do not actually imply non-resync activity.  The test
currently allows for some fuzz (+/- 4096) but there are some cases when it
is not enough.

The test currently looks for any (non-fuzz) difference, either positive or
negative.  This clearly is not needed.  Any non-sync activity will cause
the total sectors to grow faster than the sync_io count (never slower) so
we only need to look for a positive differences.

If we do this then the amount of in-flight sync io will never cause the
appearance of non-sync IO.  Once enough non-sync IO to worry about starts
happening, resync will be slowed down and the measurements will thus be
more precise (as there is less in-flight) and control of resync will still
be suitably responsive.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:37 -07:00
NeilBrown
dd00a99e7a md: avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem.
When a raid1 has only one working drive, we want read error to propagate up
to the filesystem as there is no point failing the last drive in an array.

Currently the code perform this check is racy.  If a write and a read a
both submitted to a device on a 2-drive raid1, and the write fails followed
by the read failing, the read will see that there is only one working drive
and will pass the failure up, even though the one working drive is actually
the *other* one.

So, tighten up the locking.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:53 -07:00
Linus Torvalds
44ce6294d0 Revert "md: improve partition detection in md array"
This reverts commit 5b479c91da.

Quoth Neil Brown:

  "It causes an oops when auto-detecting raid arrays, and it doesn't
   seem easy to fix.

   The array may not be 'open' when do_md_run is called, so
   bdev->bd_disk might be NULL, so bd_set_size can oops.

   This whole approach of opening an md device before it has been
   assembled just seems to get more and more painful.  I think I'm going
   to have to come up with something clever to provide both backward
   comparability with usage expectation, and sane integration into the
   rest of the kernel."

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 18:51:36 -07:00
NeilBrown
5b479c91da md: improve partition detection in md array
md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.

However that doesn't happen until the array is opened, which is later
than some people would like.

So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.

This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
NeilBrown
08a02ecd28 md: allow reshape_position for md arrays to be set via sysfs
"reshape_position" records how much progress has been made on a "reshape"
(adding drives, changing layout or chunksize).

When it is set, the number of drives, layout and chunksize can have
two possible values, an old an a new.

So allow these different values to be visible, and allow both old and new to
be set: Set the old ones first, then the reshape_position, then the new
values.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
NeilBrown
42b9bebe3f md: remove the slash from the name of a kmem_cache used by raid5
SLUB doesn't like slashes as it wants to use the cache name as the name of a
directory (or symlink) in sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
NeilBrown
4d167f0937 md: stop using csum_partial for checksum calculation in md
If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
use it.  We shouldn't really be using csum_partial anyway as it is an
internal-to-networking interface.

So replace it with C code to do the same thing.  Speed is not crucial here, so
something simple and correct is best.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
NeilBrown
e11e93facc md: move test for whether level supports bitmap to correct place
We need to check for internal-consistency of superblock in load_super.
validate_super is for inter-device consistency.

With the test in the wrong place, a badly created array will confuse md rather
an produce sensible errors.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
Martin Peschke
c3f94b40e1 md: cleanup: use seq_release_private() where appropriate
We can save some lines of code by using seq_release_private().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
Ahmed S. Darwish
50511da3da drivers/md.c: Use ARRAY_SIZE macro when appropriate
Use ARRAY_SIZE macro already defined in kernel.h

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:57 -07:00
Jonathan Brassow
ba8b45cea5 dm log: fix resume failed log device
This patch removes the possibility of having uninitialized log state if the
log device has failed.

When a mirror resumes operation, it calls 'resume' on the logging module.  If
disk based logging is being used, the log device is read to fill in the log
state.  If the log device has failed, we cannot simply return, because this
would leave the in-memory log state uninitialized.  Instead, we assume all
regions are out-of-sync and reset the log state.  Failure to do this could
result in the logging code reporting a region as in-sync, even though it
isn't; which could result in a corrupted mirror.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Jonathan Brassow
b997b82d26 dm raid1: switch rh_in_sync to blocking in do_reads
The call to rh_in_sync() in do_reads() should be allowed to block.  It is in
the mirror worker thread which already permits blocking operations.  This will
be needed to support clustered mirroring which will perform network
operations.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Jonathan Brassow
f5353cd7c9 dm raid1: fix to commit pending clear region requests
With the code as it is, it is possible for oustanding clear region requests
never to get flushed when a mirror is deactivated or suspended.  This means
there will always be some resync work required when a mirror is activated,
even though it may very well be in-sync.

Always requesting the flush doesn't hurt us.  This is because the log tracks
whether any changes occurred and, if not, no flush is performed.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Heinz Mauelshagen
26b9f22870 dm: delay target
New device-mapper target that can delay I/O (for testing).  Reads can be
separated from writes, redirected to different underlying devices and delayed
by differing amounts of time.

Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Heinz Mauelshagen
0ba699347e dm: bio list helpers
More bio_list helper functions for new targets (including dm-delay and
dm-loop) to manipulate lists of bios.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Bryn Reeves <breeves@redhat.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Milan Broz
bf17ce3a60 dm io: remove old interface
Remove old dm-io interface.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Milan Broz
88be163abb dm raid1: update dm io interface
This patch ports dm-raid1.c to the new dm-io interface.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Heinz Mauelshagen
5d234d1e03 dm log: update dm io interface
This patch ports dm-log.c to the new dm-io interface in order to make it
scalable to have a large number of persistent dirty logs active in parallel.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Heinz Mauelshagen
6cca1e7af5 dm exception store: update dm io interface
This patch ports dm-exception-store.c to the new, scalable dm_io() interface.

It replaces dm_io_get()/dm_io_put() by
dm_io_client_create()/dm_io_client_destroy() calls and
dm_io_sync_vm() by dm_io() to achive this.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Milan Broz
373a392bd7 dm kcopyd: update dm io interface
This patch ports kcopyd.c to the new, scalable dm_io() interface.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Heinz Mauelshagen
c8b03afe3d dm io: new interface
Add a new API to dm-io.c that uses a private mempool and bio_set for each
client.

The new functions to use are dm_io_client_create(), dm_io_client_destroy(),
dm_io_client_resize() and dm_io().

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Heinz Mauelshagen
891ce20701 dm io: prepare for new interface
Introduce struct dm_io_client to prepare for per-client mempools and bio_sets.

Temporary functions bios() and io_pool() choose between the per-client
structures and the global ones so the old and new interfaces can co-exist.

Make error_bits optional.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Heinz Mauelshagen
c897feb3dc dm io: delay dec_count
Delay decrementing the 'struct io' reference count until after the bio has
been freed so that a bio destructor function may reference it.  Required by a
later patch.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Jonathan E Brassow
a8e6afa236 dm raid1: add handle_errors feature flag
This patch adds the ability to specify desired features in the mirror
constructor/mapping table.

The first feature of interest is "handle_errors".  Currently, mirroring will
ignore any I/O errors from the devices.  Subsequent patches will check for
this flag and handle the errors.  If flag/feature is not present, mirror will
do nothing - maintaining backwards compatibility.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Jonathan E Brassow
315dcc226f dm log: report fault status
This patch reports the status of the log device so that userspace can detect
the error and take appropriate action.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Jonathan E Brassow
01d03a660e dm log: fault detection
This patch gives the disk logging code the ability to store the fact that an
error occured on the log device.  In addition, an event is raised when an
error is encountered during I/O to the log device.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Mike Anderson
2cd54d9bed dm: allow offline devices
Allow check_device_area to succeed if a device has an i_size of zero.  This
addresses an issue seen on DASD devices setting up a multipath table for paths
in online and offline state.

Signed-off-by: Mike Anderson <andmike@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Edward Goggin
79eb885c96 dm mpath: log device name
Make the mapped device structure accessible to hardware handlers so error
messages can include the device name.

Signed-off-by: Edward Goggin <egoggin@emc.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Ludwig Nussel
46b477306a dm crypt: add null iv
Add a new IV generation method 'null' to read old filesystem images created
with SuSE's loop_fish2 module.

Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de>
Acked-By: Christophe Saout <christophe@saout.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Olaf Kirch
f97380bcad dm crypt: use smaller bvecs in clones
Allocate smaller clones

With the previous dm-crypt fixes, there is no need for the clone bios to have
the same bvec size as the original - we just need to make them big enough for
the remaining number of pages.  The only requirement is that we clear the
"out" index in convert_context, so that crypt_convert starts storing data at
the right position within the clone bio.

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Olaf Kirch
2f9941b6c5 dm crypt: fix remove first_clone
Get rid of first_clone in dm-crypt

This gets rid of first_clone, which is not really needed.  Apparently, cloned
bios used to share their bvec some time way in the past - this is no longer
the case.  Contrarily, this even hurts us if we try to create a clone off
first_clone after it has completed, and crypt_endio has destroyed its bvec.

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Olaf Kirch
98221eb757 dm crypt: fix avoid cloned bio ref after free
Do not access the bio after generic_make_request

We should never access a bio after generic_make_request - there's no guarantee
it still exists.

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Olaf Kirch
027581f351 dm crypt: fix call to clone_init
Call clone_init early

We need to call clone_init as early as possible - at least before call
bio_put(clone) in any error path.  Otherwise, the destructor will try to
dereference bi_private, which may still be NULL.

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Milan Broz
9c89f8be1a dm crypt: disable barriers
Disable barriers in dm-crypt because of current workqueue processing can
reorder requests.

This must be addresed later but for now disabling barriers is needed to
prevent data corruption.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Holger Smolinski
6ad36fe2b4 dm raid1: one kmirrord per mirror
This patch replaces the single instance of kmirrord by one instance per mirror
set.  This change is required to avoid a deadlock in kmirrord when the
persistent dirty log of a mirror itself resides on a mirror.  The single
instance of kmirrord then issues a sync write to the dirty log in write_bits
which gets deferred to kmirrord itself later in the call chain.  But kmirrord
never does the deferred work because it is still waiting for the sync
write_bits.

_mirror_sets is removed as it no longer needed, and we always flush the
workqueue before destroying it to ensure all work is complete before
destroying it.

Signed-off-by: Holger Smolinski <smolinski@de.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Mark Fasheh
ef51c97623 Remove do_sync_file_range()
Remove do_sync_file_range() and convert callers to just use
do_sync_mapping_range().

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:04 -07:00
Peter Zijlstra
f98393a64c mm: remove destroy_dirty_buffers from invalidate_bdev()
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).

find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
	quilt add $file;
	sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Jens Axboe
5972511b77 [BLOCK] Don't pin lots of memory in mempools
Currently we scale the mempool sizes depending on memory installed
in the machine, except for the bio pool itself which sits at a fixed
256 entry pre-allocation.

There's really no point in "optimizing" this OOM path, we just need
enough preallocated to make progress. A single unit is enough, lets
scale it down to 2 just to be on the safe side.

This patch saves ~150kb of pinned kernel memory on a 32-bit box.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:08:17 +02:00
Neil Brown
505fa2c4a2 [PATCH] md: fix calculation for size of filemap_attr array in md/bitmap
If 'num_pages' were ever 1 more than a multiple of 8 (32bit platforms)
or of 16 (64 bit platforms).  filemap_attr would be allocated one
'unsigned long' shorter than required.  We need a round-up in there.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-12 15:31:42 -07:00
NeilBrown
5792a2856a [PATCH] md: avoid a deadlock when removing a device from an md array via sysfs
A device can be removed from an md array via e.g.
  echo remove > /sys/block/md3/md/dev-sde/state

This will try to remove the 'dev-sde' subtree which will deadlock
since
  commit e7b0d26a86

With this patch we run the kobject_del via schedule_work so as to
avoid the deadlock.

Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
NeilBrown
5e55e2f5fc [PATCH] md: convert compile time warnings into runtime warnings
...  still not sure why we need this ....

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:15 -07:00
NeilBrown
041ae52e26 [PATCH] md: clear the congested_fn when stopping a raid5
If this mddev and queue got reused for another array that doesn't register a
congested_fn, this function would get called incorretly.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:14 -07:00
NeilBrown
3d37890baa [PATCH] md: allow raid4 arrays to be reshaped
All that is missing the the function pointers in raid4_pers.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:14 -07:00
Andy Isaacson
bed31ed9e1 [PATCH] fix read past end of array in md/linear.c
When iterating through an array, one must be careful to test one's index
variable rather than another similarly-named variable.

The loop will read off the end of conf->disks[] in the following
(pathological) case:

  % dd bs=1 seek=840716287 if=/dev/zero of=d1 count=1
  % for i in 2 3 4; do dd if=/dev/zero of=d$i bs=1k count=$(($i+150)); done
  % ./vmlinux ubd0=root ubd1=d1 ubd2=d2 ubd3=d3 ubd4=d4
  # mdadm -C /dev/md0 --level=linear --raid-devices=4 /dev/ubd[1234]

adding some printks, I saw this:

  [42949374.960000] hash_spacing = 821120
  [42949374.960000] cnt          = 4
  [42949374.960000] min_spacing  = 801
  [42949374.960000] j=0 size=820928 sz=820928
  [42949374.960000] i=0 sz=820928 hash_spacing=820928
  [42949374.960000] j=1 size=64 sz=64
  [42949374.960000] j=2 size=64 sz=128
  [42949374.960000] j=3 size=64 sz=192
  [42949374.960000] j=4 size=1515870810 sz=1515871002

Cc: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Neil Brown <neilb@cse.unsw.edu.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:03 -07:00
NeilBrown
6d3baf2eb8 [PATCH] md: fix for raid6 reshape
Recent patch for raid6 reshape had a change missing that showed up in
subsequent review.

Many places in the raid5 code used "conf->raid_disks-1" to mean "number of
data disks".  With raid6 that had to be changed to "conf->raid_disk -
conf->max_degraded" or similar.  One place was missed.

This bug means that if a raid6 reshape were aborted in the middle the
recorded position would be wrong.  On restart it would either fail (as the
position wasn't on an appropriate boundary) or would leave a section of the
array unreshaped, causing data corruption.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:53 -08:00
NeilBrown
f416885ef4 [PATCH] md: add support for reshape of a raid6
i.e. one or more drives can be added and the array will re-stripe
while on-line.

Most of the interesting work was already done for raid5.  This just extends it
to raid6.

mdadm newer than 2.6 is needed for complete safety, however any version of
mdadm which support raid5 reshape will do a good enough job in almost all
cases (an 'echo repair > /sys/block/mdX/md/sync_action' is recommended after a
reshape that was aborted and had to be restarted with an such a version of
mdadm).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
NeilBrown
b4c4c7b809 [PATCH] md: restart a (raid5) reshape that has been aborted due to a read/write error
An error always aborts any resync/recovery/reshape on the understanding that
it will immediately be restarted if that still makes sense.  However a reshape
currently doesn't get restarted.  With this patch it does.

To avoid restarting when it is not possible to do work, we call into the
personality to check that a reshape is ok, and strengthen raid5_check_reshape
to fail if there are too many failed devices.

We also break some code out into a separate function: remove_and_add_spares as
the indent level for that code was getting crazy.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
NeilBrown
d1b5380c7f [PATCH] md: clean out unplug and other queue function on md shutdown
The mddev and queue might be used for another array which does not set these,
so they need to be cleared.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
NeilBrown
7dd5e7c3db [PATCH] md: move warning about creating a raid array on partitions of the one device
md tries to warn the user if they e.g.  create a raid1 using two partitions of
the same device, as this does not provide true redundancy.

However it also warns if a raid0 is created like this, and there is nothing
wrong with that.

At the place where the warning is currently printer, we don't necessarily know
what level the array will be, so move the warning from the point where the
device is added to the point where the array is started.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
H. Peter Anvin
a723406c4a [PATCH] md: RAID6: clean up CPUID and FPU enter/exit code
- Use kernel_fpu_begin() and kernel_fpu_end()
- Use boot_cpu_has() for feature testing even in userspace

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
NeilBrown
64a742bc61 [PATCH] md: fix raid10 recovery problem.
There are two errors that can lead to recovery problems with raid10
when used in 'far' more (not the default).

Due to a '>' instead of '>=' the wrong block is located which would result in
garbage being written to some random location, quite possible outside the
range of the device, causing the newly reconstructed device to fail.

The device size calculation had some rounding errors (it didn't round when it
should) and so recovery would go a few blocks too far which would again cause
a write to a random block address and probably a device error.

The code for working with device sizes was fairly confused and spread out, so
this has been tided up a bit.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
Eric W. Biederman
0b4d414714 [PATCH] sysctl: remove insert_at_head from register_sysctl
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name.  Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
ff1d28efc5 [PATCH] sysctl: md: remove unnecessary insert_at_head flag
The sysctls used by the md driver are have unique binary numbers so remove the
insert_at_head flag as it serves no useful purpose.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:55 -08:00
Arjan van de Ven
fa027c2a0a [PATCH] mark struct file_operations const 4
Many struct file_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

[akpm@sdl.org: dvb fix]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:45 -08:00
Andrew Morton
fc0ecff698 [PATCH] remove invalidate_inode_pages()
Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().

Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:31 -08:00
Neil Brown
da6e1a32fb [PATCH] md: avoid possible BUG_ON in md bitmap handling
md/bitmap tracks how many active write requests are pending on blocks
associated with each bit in the bitmap, so that it knows when it can clear
the bit (when count hits zero).

The counter has 14 bits of space, so if there are ever more than 16383, we
cannot cope.

Currently the code just calles BUG_ON as "all" drivers have request queue
limits much smaller than this.

However is seems that some don't.  Apparently some multipath configurations
can allow more than 16383 concurrent write requests.

So, in this unlikely situation, instead of calling BUG_ON we now wait
for the count to drop down a bit.  This requires a new wait_queue_head,
some waiting code, and a wakeup call.

Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
in that case...).

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:47 -08:00
Neil Brown
387bb17374 [PATCH] md: fix various bugs with aligned reads in RAID5
It is possible for raid5 to be sent a bio that is too big for an underlying
device.  So if it is a READ that we pass stright down to a device, it will
fail and confuse RAID5.

So in 'chunk_aligned_read' we check that the bio fits within the parameters
for the target device and if it doesn't fit, fall back on reading through
the stripe cache and making lots of one-page requests.

Note that this is the earliest time we can check against the device because
earlier we don't have a lock on the device, so it could change underneath
us.

Also, the code for handling a retry through the cache when a read fails has
not been tested and was badly broken.  This patch fixes that code.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Kai" <epimetreus@fastmail.fm>
Cc: <stable@suse.de>
Cc: <org@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:46 -08:00
NeilBrown
c20086de93 [PATCH] md: remove unnecessary printk when raid5 gets an unaligned read.
raid5_mergeable_bvec tries to ensure that raid5 never sees a read request
that does not fit within just one chunk.  However as we must always accept
a single-page read, that is not always possible.

So when "in_chunk_boundary" fails, it might be unusual, but it is not a
problem and printing a message every time is a bad idea.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:51:00 -08:00
NeilBrown
2a2275d630 [PATCH] md: fix potential memalloc deadlock in md
If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
it is possible for a deadlock to eventuate.

This happens if the array was marked 'clean', and the memalloc triggers a
write-out to the md device.

For the writeout to succeed, the array must be marked 'dirty', and that
requires getting the mddev_lock.

So, before attempting a GFP_KERNEL allocation while holding the lock, make
sure the array is marked 'dirty' (unless it is currently read-only).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:51:00 -08:00
Jun'ichi Nomura
bfa152fa5e [PATCH] dm-multipath: fix stall on noflush suspend/resume
Allow noflush suspend/resume of device-mapper device only for the case
where the device size is unchanged.

Otherwise, dm-multipath devices can stall when resumed if noflush was used
when suspending them, all paths have failed and queue_if_no_path is set.

Explanation:
 1. Something is doing fsync() on the block dev,
    holding inode->i_sem
 2. The fsync write is blocked by all-paths-down and queue_if_no_path
 3. Someone requests to suspend the dm device with noflush.
    Pending writes are left in queue.
 4. In the middle of dm_resume(), __bind() tries to get
    inode->i_sem to do __set_size() and waits forever.

'noflush suspend' is a new device-mapper feature introduced in
early 2.6.20. So I hope the fix being included before 2.6.20 is
released.

Example of reproducer:
 1. Create a multipath device by dmsetup
 2. Fail all paths during mkfs
 3. Do dmsetup suspend --noflush and load new map with healthy paths
 4. Do dmsetup resume

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:51:00 -08:00
NeilBrown
f49d5e62d9 [PATCH] md: avoid reading past the end of a bitmap file
In most cases we check the size of the bitmap file before reading data from
it.  However when reading the superblock, we always read the first PAGE_SIZE
bytes, which might not always be appropriate.  So limit that read to the size
of the file if appropriate.

Also, we get the count of available bytes wrong in one place, so that too can
read past the end of the file.

Cc: "yang yin" <yinyang801120@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:59 -08:00
NeilBrown
1031be7a5f [PATCH] md: make sure the events count in an md array never returns to zero
Now that we sometimes step the array events count backwards (when
transitioning dirty->clean where nothing else interesting has happened - so
that we don't need to write to spares all the time), it is possible for the
event count to return to zero, which is potentially confusing and triggers and
MD_BUG.

We could possibly remove the MD_BUG, but is just as easy, and probably safer,
to make sure we never return to zero.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:59 -08:00
NeilBrown
3eda22d19b [PATCH] md: make 'repair' actually work for raid1
When 'repair' finds a block that is different one the various parts of the
mirror.  it is meant to write a chosen good version to the others.  However it
currently writes out the original data to each.  The memcpy to make all the
data the same is missing.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:50:59 -08:00
Lars Ellenberg
e3881a6816 [PATCH] md: pass down BIO_RW_SYNC in raid{1,10}
md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
introducing additional latency.

Fixing this in raid1 and raid10 seems to be straightforward enough.

For our particular usage case in DRBD, passing this flag improved some
initialization time from ~5 minutes to ~5 seconds.

Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Lars Ellenberg <lars@linbit.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:21 -08:00
NeilBrown
3f9d7b0d81 [PATCH] md: fix a few problems with the interface (sysfs and ioctl) to md
While developing more functionality in mdadm I found some bugs in md...

- When we remove a device from an inactive array (write 'remove' to
  the 'state' sysfs file - see 'state_store') would should not
  update the superblock information - as we may not have
  read and processed it all properly yet.

- initialise all raid_disk entries to '-1' else the 'slot sysfs file
  will claim '0' for all devices in an array before the array is
  started.

- all '\n' not to be present at the end of words written to
  sysfs files

- when we use SET_ARRAY_INFO to set the md metadata version,
  set the flag to say that there is persistant metadata.

- allow GET_BITMAP_FILE to be called on an array that hasn't
  been started yet.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:51 -08:00
NeilBrown
802ba064c4 [PATCH] md: Don't assume that READ==0 and WRITE==1 - use the names explicitly
Thanks Jens for alerting me to this.

Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: <raziebe@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:48 -08:00
Herbert Xu
3263263f70 [CRYPTO] dm-crypt: Select CRYPTO_CBC
As CBC is the default chaining method for cryptoloop, we should select
it from cryptoloop to ease the transition.  Spotted by Rene Herman.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:18:57 -08:00
NeilBrown
1757128438 [PATCH] md: assorted md and raid1 one-liners
Fix few bugs that meant that:
  - superblocks weren't alway written at exactly the right time (this
    could show up if the array was not written to - writting to the array
    causes lots of superblock updates and so hides these errors).

  - restarting device recovery after a clean shutdown (version-1 metadata
    only) didn't work as intended (or at all).

1/ Ensure superblock is updated when a new device is added.
2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
   The body of this if takes one of two branches depending on whether
   MD_RECOVERY_SYNC is set, so testing it in the clause of the if
   is wrong.
3/ Flag superblock for updating after a resync/recovery finishes.
4/ If we find the neeed to restart a recovery in the middle (version-1
   metadata only) make sure a full recovery (not just as guided by
   bitmaps) does get done.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
NeilBrown
c2b00852fb [PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5
Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error
to higher levels.  While this should be sufficient, it is safer to explicitly
set the error code as well - less room for confusion.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
NeilBrown
b8c6b64556 [PATCH] md: remove some old ifdefed-out code from raid5.c
There are some vestiges of old code that was used for bypassing the stripe
cache on reads in raid5.c.  This was never updated after the change from
buffer_heads to bios, but was left as a reminder.

That functionality has nowe been implemented in a completely different way, so
the old code can go.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
Jeff Garzik
fdee8ae449 [PATCH] MD: conditionalize some code
The autorun code is only used if this module is built into the static
kernel image.  Adjust #ifdefs accordingly.

Signed-off-by: Jeff Garzik <jeff@garzik.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
NeilBrown
b875e531fc [PATCH] md: fix innocuous bug in raid6 stripe_to_pdidx
stripe_to_pdidx finds the index of the parity disk for a given stripe.  It
assumes raid5 in that it uses "disks-1" to determine the number of data disks.

This is incorrect for raid6 but fortunately the two usages cancel each other
out.  The only way that 'data_disks' affects the calculation of pd_idx in
raid5_compute_sector is when it is divided into the sector number.  But as
that sector number is calculated by multiplying in the wrong value of
'data_disks' the division produces the right value.

So it is innocuous but needs to be fixed.

Also change the calculation of raid_disks in compute_blocknr to make it
more obviously correct (it seems at first to always use disks-1 too).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
Raz Ben-Jehuda(caro)
5248861511 [PATCH] md: enable bypassing cache for reads
Call the chunk_aligned_read where appropriate.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)
46031f9a38 [PATCH] md: allow reads that have bypassed the cache to be retried on failure
If a bypass-the-cache read fails, we simply try again through the cache.  If
it fails again it will trigger normal recovery precedures.

update 1:

From: NeilBrown <neilb@suse.de>

1/
  chunk_aligned_read and retry_aligned_read assume that
      data_disks == raid_disks - 1
  which is not true for raid6.
  So when an aligned read request bypasses the cache, we can get the wrong data.

2/ The cloned bio is being used-after-free in raid5_align_endio
   (to test BIO_UPTODATE).

3/ We forgot to add rdev->data_offset when submitting
   a bio for aligned-read

4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
   so we need to invalidate the segment counts.

5/ We don't de-reference the rdev when the read completes.
   This means we need to record the rdev to so it is still
   available in the end_io routine.  Fortunately
   bi_next in the original bio is unused at this point so
   we can stuff it in there.

6/ We leak a cloned bio if the target rdev is not usable.

From: NeilBrown <neilb@suse.de>

update 2:

1/ When aligned requests fail (read error) they need to be retried
   via the normal method (stripe cache).  As we cannot be sure that
   we can process a single read in one go (we may not be able to
   allocate all the stripes needed) we store a bio-being-retried
   and a list of bioes-that-still-need-to-be-retried.
   When find a bio that needs to be retried, we should add it to
   the list, not to single-bio...

2/ We were never incrementing 'scnt' when resubmitting failed
   aligned requests.

[akpm@osdl.org: build fix]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)
f679623f50 [PATCH] md: handle bypassing the read cache (assuming nothing fails)
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:20 -08:00
Raz Ben-Jehuda(caro)
23032a0eb9 [PATCH] md: define raid5_mergeable_bvec
This will encourage read request to be on only one device, so we will often be
able to bypass the cache for read requests.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:20 -08:00
NeilBrown
0d4ca600fc [PATCH] md: tidy up device-change notification when an md array is stopped
An md array can be stopped leaving all the setting still in place, or it can
torn down and destroyed.  set_capacity and other change notifications only
happen in the latter case, but should happen in both.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:20 -08:00
Adrian Bunk
c642f9e03b [PATCH] make drivers/md/dm-snap.c:ksnapd static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Jonathan E Brassow
88b20a1a71 [PATCH] dm: raid1: reset sync_search on resume
Reset sync_search on resume.  The effect is to retry syncing all out-of-sync
regions when a mirror is resumed, including ones that previously failed.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Jonathan E Brassow
f3ee6b2f62 [PATCH] dm: log: rename complete_resync_work
The complete_resync_work function only provides the ability to change an
out-of-sync region to in-sync.  This patch enhances the function to allow us
to change the status from in-sync to out-of-sync as well, something that is
needed when a mirror write to one of the devices or an initial resync on a
given region fails.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Milan Broz
31c93a0c29 [PATCH] dm: snapshot: abstract memory release
Move the code that releases memory used by a snapshot into a separate function.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda
45e157206c [PATCH] dm: mpath: use noflush suspending
Implement the pushback feature for the multipath target.

The pushback request is used when:
  1) there are no valid paths;
  2) queue_if_no_path was set;
  3) a suspend is being issued with the DMF_NOFLUSH_SUSPENDING flag.
     Otherwise bios are returned to applications with -EIO.

To check whether queue_if_no_path is specified or not, you need to check
both queue_if_no_path and saved_queue_if_no_path, because presuspend saves
the original queue_if_no_path value to saved_queue_if_no_path.

The check for 1 already exists in both map_io() and do_end_io().
So this patch adds __must_push_back() to check 2 and 3.

Test results:
See the test results in the preceding patch.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda
2e93ccc193 [PATCH] dm: suspend: add noflush pushback
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.

This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core.  This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it.  Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application.  In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.

    DMF_NOFLUSH_SUSPENDING
    ----------------------

If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend().  It
is always cleared before dm_suspend() returns.

The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.

Target drivers can check this flag by calling dm_noflush_suspending().

    DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
    -----------------------------------

A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.

Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same.  This has been labelled 'pushback'.

The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.

    dec_pending
    -----------

dec_pending() saves the pushback request in struct dm_io->error.  Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list.  Note that this supercedes any I/O errors.

It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt).  dec_pending() checks for this and
returns -EIO if it happened.

    pushdback list and pushback_lock
    --------------------------------

The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.

md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.

Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag.  So md->pushback_lock is
held when checking the flag.  Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.

Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.

The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above.  So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.

    Other notes on the current patch
    --------------------------------

- md->pushback is added to the struct mapped_device instead of using
  md->deferred directly because md->io_lock which protects md->deferred is
  rw_semaphore and can't be used in interrupt context like dec_pending(),
  and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.

- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
  ioctl option is specified, because I/Os generated by lock_fs() would be
  pushed back and never return if there were no valid devices.

- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
  flag is set, md->pushback must be flushed because I/Os may be queued to
  the list already.  (flush_and_out label in dm_suspend())

    Test results
    ------------

I have tested using multipath target with the next patch.

The following tests are for regression/compatibility:
  - I/Os succeed when valid paths exist;
  - I/Os fail when there are no valid paths and queue_if_no_path is not
    set;
  - I/Os are queued in the multipath target when there are no valid paths and
    queue_if_no_path is set;
  - The queued I/Os above fail when suspend is issued without the
    DM_NOFLUSH_FLAG ioctl option.  I/Os spanning 2 multipath targets also
    fail.

The following tests are for the normal code path of new pushback feature:
  - Queued I/Os in the multipath target are flushed from the target
    but don't return when suspend is issued with the DM_NOFLUSH_FLAG
    ioctl option;
  - The I/Os above are queued in the multipath target again when
    resume is issued without path recovery;
  - The I/Os above succeed when resume is issued after path recovery
    or table load;
  - Queued I/Os in the multipath target succeed when resume is issued
    with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
    spanning 2 multipath targets also succeed.

The following tests are for the error paths of the new pushback feature:
  - When the bdget_disk() fails in dm_suspend(), the
    DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
    pushback list are flushed properly.
  - When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
      o I/Os which had already been queued to the pushback list
        at the time don't return, and are re-issued at resume time;
      o I/Os which hadn't been returned at the time return with EIO.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda
81fdb096db [PATCH] dm: ioctl: add noflush suspend
Provide a dm ioctl option to request noflush suspending.  (See next patch for
what this is for.) As the interface is extended, the version number is
incremented.

Other than accepting the new option through the interface, There is no change
to existing behaviour.

Test results:
Confirmed the option is given from user-space correctly.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda
d2a7ad29a8 [PATCH] dm: map and endio symbolic return codes
Update existing targets to use the new symbols for return values from target
map and end_io functions.

There is no effect on behaviour.

Test results:
Done build test without errors.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda
45cbcd7983 [PATCH] dm: map and endio return code clarification
Tighten the use of return values from the target map and end_io functions.
Values of 2 and above are now explictly reserved for future use.  There are no
existing targets using such values.

The patch has no effect on existing behaviour.

o Reserve return values of 2 and above from target map functions.
  Any positive value currently indicates "mapping complete", but all
  existing drivers use the value 1.  We now make that a requirement
  so we can assign new meaning to higher values in future.

  The new definition of return values from target map functions is:
      < 0 : error
      = 0 : The target will handle the io (DM_MAPIO_SUBMITTED).
      = 1 : Mapping completed (DM_MAPIO_REMAPPED).
      > 1 : Reserved (undefined).  Previously this was the same as '= 1'.

o Reserve return values of 2 and above from target end_io functions
  for similar reasons.
  DM_ENDIO_INCOMPLETE is introduced for a return value of 1.

Test results:

  I have tested by using the multipath target.

  I/Os succeed when valid paths exist.

  I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set.

  I/Os fail when there are no valid paths and queue_if_no_path is not set.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda
a3d77d35be [PATCH] dm: suspend: parameter change
Change the interface of dm_suspend() so that we can pass several options
without increasing the number of parameters.  The existing 'do_lockfs' integer
parameter is replaced by a flag DM_SUSPEND_LOCKFS_FLAG.

There is no functional change to the code.

Test results:
I have tested 'dmsetup suspend' command with/without the '--nolockfs'
option and confirmed the do_lockfs value is correctly set.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda
7485936463 [PATCH] dm: tidy core formatting
Remove unnecessary spaces in dm.c.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:08 -08:00
Heinz Mauelshagen
f00b16ad66 [PATCH] dm io: fix bi_max_vecs
The existing code allocates an extra slot in bi_io_vec[] and uses it to store
the region number.

This patch hides the extra slot from bio_add_page() so the region number can't
get overwritten.

Also remove a hard-coded SECTOR_SHIFT and fix a typo in a comment.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Milan Broz <mbroz@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:08 -08:00
David Howells
f0d1b0b30d [PATCH] LOG2: Implement a general integer log2 facility in the kernel
This facility provides three entry points:

	ilog2()		Log base 2 of unsigned long
	ilog2_u32()	Log base 2 of u32
	ilog2_u64()	Log base 2 of u64

These facilities can either be used inside functions on dynamic data:

	int do_something(long q)
	{
		...;
		y = ilog2(x)
		...;
	}

Or can be used to statically initialise global variables with constant values:

	unsigned n = ilog2(27);

When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant.  They treat negative numbers as
unsigned.

When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.

[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Josef Sipek
c649bb9c55 [PATCH] struct path: convert md
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:47 -08:00
Josef "Jeff" Sipek
c922d5f7f5 [PATCH] struct path: rename DM's struct path
Rename DM's struct path to struct dm_path to prevent name collision between it
and struct path from fs/namei.c.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: <reiserfs-dev@namesys.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:40 -08:00
NeilBrown
d63a5a74de [PATCH] lockdep: avoid lockdep warning in md
md_open takes ->reconfig_mutex which causes lockdep to complain.  This
(normally) doesn't have deadlock potential as the possible conflict is with a
reconfig_mutex in a different device.

I say "normally" because if a loop were created in the array->member hierarchy
a deadlock could happen.  However that causes bigger problems than a deadlock
and should be fixed independently.

So we flag the lock in md_open as a nested lock.  This requires defining
mutex_lock_interruptible_nested.

Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:39 -08:00
Peter Zijlstra
2e7b651df1 [PATCH] remove the old bd_mutex lockdep annotation
Remove the old complex and crufty bd_mutex annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Linus Torvalds
2685b267bc Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (48 commits)
  [NETFILTER]: Fix non-ANSI func. decl.
  [TG3]: Identify Serdes devices more clearly.
  [TG3]: Use msleep.
  [TG3]: Use netif_msg_*.
  [TG3]: Allow partial speed advertisement.
  [TG3]: Add TG3_FLG2_IS_NIC flag.
  [TG3]: Add 5787F device ID.
  [TG3]: Fix Phy loopback.
  [WANROUTER]: Kill kmalloc debugging code.
  [TCP] inet_twdr_hangman: Delete unnecessary memory barrier().
  [NET]: Memory barrier cleanups
  [IPSEC]: Fix inetpeer leak in ipv4 xfrm dst entries.
  audit: disable ipsec auditing when CONFIG_AUDITSYSCALL=n
  audit: Add auditing to ipsec
  [IRDA] irlan: Fix compile warning when CONFIG_PROC_FS=n
  [IrDA]: Incorrect TTP header reservation
  [IrDA]: PXA FIR code device model conversion
  [GENETLINK]: Fix misplaced command flags.
  [NETLIK]: Add a pointer to the Generic Netlink wiki page.
  [IPV6] RAW: Don't release unlocked sock.
  ...
2006-12-07 09:05:15 -08:00
Nigel Cunningham
7dfb71030f [PATCH] Add include/linux/freezer.h and move definitions from sched.h
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.

[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Christoph Lameter
e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Herbert Xu
79066ad32b [CRYPTO] dm-crypt: Make iv_gen_private a union
Rather than stuffing integers into pointers with casts, let's use
a union.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-06 18:39:01 -08:00
Herbert Xu
45789328e5 [BLOCK] dm-crypt: Align IV to u64 for essiv
This patch makes the IV u64-aligned since essiv does a u64 store to it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2006-12-06 18:38:48 -08:00
Rik Snel
48527fa7cf [BLOCK] dm-crypt: benbi IV, big endian narrow block count for LRW-32-AES
LRW-32-AES needs a certain IV. This IV should be provided dm-crypt.
The block cipher mode could, in principle generate the correct IV from
the plain IV, but I think that it is cleaner to supply the right IV
directly.

The sector -> narrow block calculation uses a shift for performance reasons.
This shift is computed in .ctr and stored in cc->iv_gen_private (as a void *).

Signed-off-by: Rik Snel <rsnel@cube.dyndns.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2006-12-06 18:38:47 -08:00
David Howells
c4028958b6 WorkStruct: make allyesconfig
Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:57:56 +00:00
Rafael J. Wysocki
4b438a23fb [PATCH] md: do not freeze md threads for suspend
If there's a swap file on a software RAID, it should be possible to use this
file for saving the swsusp's suspend image.  Also, this file should be
available to the memory management subsystem when memory is being freed before
the suspend image is created.

For the above reasons it seems that md_threads should not be frozen during the
suspend and the appended patch makes this happen, but then there is the
question if they don't cause any data to be written to disks after the suspend
image has been created, provided that all filesystems are frozen at that time.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:24 -08:00
NeilBrown
0692c6b1cf [PATCH] md: fix sizing problem with raid5-reshape and CONFIG_LBD=n
I forgot to has the size-in-blocks to (loff_t) before shifting up to a
size-in-bytes.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:24 -08:00
NeilBrown
2f47130361 [PATCH] md: change ONLINE/OFFLINE events to a single CHANGE event
It turns out that CHANGE is preferred to ONLINE/OFFLINE for various reasons
(not least of which being that udev understands it already).

So remove the recently added KOBJ_OFFLINE (no-one is likely to care anyway)
and change the ONLINE to a CHANGE event

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:23 -08:00
Jonathan E Brassow
33184048dc [PATCH] dm: raid1: fix waiting for io on suspend
All device-mapper targets must complete outstanding I/O before suspending.
The mirror target generates I/O in its recovery phase and fails to wait for
it.  It needs to be tracked so we can ensure that it has completed before we
suspend.

[akpm@osdl.org: cleanup]
Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:23 -08:00
Jonathan E Brassow
5d55fdf949 [PATCH] dm: multipath: fix rr_add_path order
When adding paths to the round-robin path selector, their order gets inverted,
which is not desirable.

Fix by replacing list_add() with list_add_tail().

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:23 -08:00
Alasdair G Kergon
d287483d6d [PATCH] dm: suspend: fix error path
If the device is already suspended, just return the error and skip the code
that would incorrectly wipe md->suspended_bdev.

(This isn't currently a problem because existing code avoids calling this
function if the device is already suspended.)

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:23 -08:00
Alasdair G Kergon
bfc5ecdf48 [PATCH] dm: fix find_device race
There is a race between dev_create() and find_device().

If the mdptr has not yet been stored against a device, find_device() needs to
behave as though no device was found.  It already returns NULL, but there is a
dm_put() missing: it must drop the reference dm_get_md() took.

The bug was introduced by dm-fix-mapped-device-ref-counting.patch.

It manifests itself if another dm ioctl attempts to reference a newly-created
device while the device creation ioctl is still running.  The consequence is
that the device cannot be removed until the machine is rebooted.  Certain udev
configurations can lead to this happening.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:23 -08:00
NeilBrown
7870db4c7f [PATCH] md: send online/offline uevents when an md array starts/stops
This allows udev to do something intelligent when an array becomes
available.

Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:55 -08:00
Christophe Saout
37af6560f7 [PATCH] Fix dmsetup table output change
Fix dm-crypt after the block cipher API changes to correctly return the
backwards compatible cipher-chainmode[-ivmode] format for "dmsetup
table".

Signed-off-by: Christophe Saout <christophe@saout.de>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff linux-2.6.19-rc3.orig/drivers/md/dm-crypt.c linux-2.6.19-rc3/drivers/md/dm-crypt.c
2006-10-30 12:02:57 -08:00
Randy Dunlap
969b755aad [PATCH] md: fix printk format warnings, seen on powerpc64:
drivers/md/raid1.c:1479: warning: long long unsigned int format, long unsigned int arg (arg 4)
drivers/md/raid10.c:1475: warning: long long unsigned int format, long unsigned int arg (arg 4)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:52 -07:00
NeilBrown
750a8f3e8f [PATCH] md: fix up maintenance of ->degraded in multipath
A recent fix which made sure ->degraded was initialised properly exposed a
second bug - ->degraded wasn't been updated when drives failed or were
hot-added.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:51 -07:00
NeilBrown
01ab5662f5 [PATCH] md: simplify checking of available size when resizing an array
When "mdadm --grow --size=xxx" is used to resize an array (use more or less of
each device), we check the new siza against the available space in each
device.

We already have that number recorded in rdev->size, so calculating it is
pointless (and wrong in one obscure case).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:51 -07:00
NeilBrown
2b6e845986 [PATCH] md: fix bug where spares don't always get rebuilt properly when they become live
If save_raid_disk is >= 0, then the device could be a device that is already
in sync that is being re-added.  So we need to default this value to -1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:51 -07:00
NeilBrown
4f2e639af4 [PATCH] md: endian annotations for the bitmap superblock
And a couple of bug fixes found by sparse.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:05 -07:00
NeilBrown
1c05b4bc22 [PATCH] md: endian annotation for v1 superblock access
Includes a couple of bugfixes found by sparse.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:05 -07:00
NeilBrown
2e333e8986 [PATCH] md: fix calculation of ->degraded for multipath and raid10
Two less-used md personalities have bugs in the calculation of ->degraded (the
extent to which the array is degraded).

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-21 13:35:05 -07:00
Andrew Morton
3fcfab16c5 [PATCH] separate bdi congestion functions from queue congestion functions
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.

The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.

This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.

Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Akinobu Mita
e24650c2e7 [PATCH] md: fix /proc/mdstat refcounting
I have seen mdadm oops after successfully unloading md module.

This patch privents from unloading md module while
mdadm is polling /proc/mdstat.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Akinbou Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:43 -07:00
Alexey Dobriyan
5f6e3c8365 [PATCH] md: use BUILD_BUG_ON
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:26 -07:00
NeilBrown
5842730de1 [PATCH] md: fix bug where new drives added to an md array sometimes don't sync properly
This fixes a bug introduced in 2.6.18.

If a drive is added to a raid1 using older tools (mdadm-1.x or raidtools)
then it will be included in the array without any resync happening.

It has been submitted for 2.6.18.1.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 08:53:41 -07:00
Eric Sesterhenn
52e5f9d1cf BUG_ON cleanup for drivers/md/
This changes two if() BUG(); usages to BUG_ON(); so people
can disable it safely.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:33:23 +02:00
NeilBrown
3a0f5bbb1a [PATCH] md: add error reporting to superblock write failure
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:19 -07:00
Paul Clements
a638b2dc95 [PATCH] md: use ffz instead of find_first_set to convert multiplier to shift
find_first_set doesn't find the least-significant bit on bigendian machines,
so it is really wrong to use it.

ffs is closer, but takes an 'int' and we have a 'unsigned long'.  So use
ffz(~X) to convert a chunksize into a chunkshift.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
14f50b49fd [PATCH] md: remove 'experimental' classification from raid5 reshape
I have had enough success reports not to believe that this is safe for 2.6.19.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
e8703fe1f5 [PATCH] md: remove MAX_MD_DEVS which is an arbitrary limit
Once upon a time we needed to fixed limit to the number of md devices,
probably because we preallocated some array.  This need no longer exists, but
we still have an arbitrary limit.

So remove MAX_MD_DEVS and allow as many devices as we can fit into the 'minor'
part of a device number.

Also remove some useless noise at init time (which reports MAX_MD_DEVS) and
remove MD_THREAD_NAME_MAX which hasn't been used for a while.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
61df9d91e9 [PATCH] md: make messages about resync/recovery etc more specific
It is possible to request a 'check' of an md/raid array where the whole array
is read and consistancies are reported.

This uses the same mechanisms as 'resync' and so reports in the kernel logs
that a resync is being started.  This understandably confuses/worries people.

Also the text in /proc/mdstat suggests a 'resync' is happen when it is just a
check.

This patch changes those messages to be more specific about what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
f022b2fddd [PATCH] md: add a ->congested_fn function for raid5/6
This is very different from other raid levels and all requests go through a
'stripe cache', and it has congestion management already.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
0d12922823 [PATCH] md: define ->congested_fn for raid1, raid10, and multipath
raid1, raid10 and multipath don't report their 'congested' status through
bdi_*_congested, but should.

This patch adds the appropriate functions which just check the 'congested'
status of all active members (with appropriate locking).

raid1 read_balance should be modified to prefer devices where
bdi_read_congested returns false.  Then we could use the '&' branch rather
than the '|' branch.  However that should would need some benchmarking first
to make sure it is actually a good idea.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
26be34dc3a [PATCH] md: define backing_dev_info.congested_fn for raid0 and linear
Each backing_dev needs to be able to report whether it is congested, either by
modulating BDI_*_congested in ->state, or by defining a ->congested_fn.
md/raid did neither of these.  This patch add a congested_fn which simply
checks all component devices to see if they are congested.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
c04be0aa82 [PATCH] md: Improve locking around error handling
The error handling routines don't use proper locking, and so two concurrent
errors could trigger a problem.

So:
  - use test-and-set and test-and-clear to synchonise
    the In_sync bits with the ->degraded count
  - use the spinlock to protect updates to the
    degraded count (could use an atomic_t but that
    would be a bigger change in code, and isn't
    really justified)
  - remove un-necessary locking in raid5

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:18 -07:00
NeilBrown
11ce99e625 [PATCH] md: Remove working_disks from raid1 state data
It is equivalent to conf->raid_disks - conf->mddev->degraded.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
NeilBrown
867868fb55 [PATCH] md: Factor out part of raid1d into a separate function
raid1d has toooo many nested block, so take the fix_read_error functionality
out into a separate function.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
Coywolf Qi Hunt
2d2063ceae [PATCH] md: remove unnecessary variable x in stripe_to_pdidx()
Signed-off-by: Coywolf Qi Hunt <qiyong@freeforge.net>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
Paul Clements
9b1d1dac18 [PATCH] md: new sysfs interface for setting bits in the write-intent-bitmap
Add a new sysfs interface that allows the bitmap of an array to be dirtied.
The interface is write-only, and is used as follows:

echo "1000" > /sys/block/md2/md/bitmap

(dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk
bitmaps of array md2)

echo "1000-2000" > /sys/block/md1/md/bitmap

(dirty the bits for chunks 1000-2000 in md1's bitmap)

This is useful, for example, in cluster environments where you may need to
combine two disjoint bitmaps into one (following a server failure, after a
secondary server has taken over the array).  By combining the bitmaps on
the two servers, a full resync can be avoided (This was discussed on the
list back on March 18, 2005, "[PATCH 1/2] md bitmap bug fixes" thread).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
NeilBrown
76186dd8b7 [PATCH] md: remove 'working_disks' from raid10 state
It isn't needed as mddev->degraded contains equivalent info.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
NeilBrown
02c2de8cc8 [PATCH] md: remove the working_disks and failed_disks from raid5 state data.
They are not needed.  conf->failed_disks is the same as mddev->degraded and
conf->working_disks is conf->raid_disks - mddev->degraded.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
NeilBrown
850b2b420c [PATCH] md: replace magic numbers in sb_dirty with well defined bit flags
Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
   Some device state has changed requiring superblock update
   on all devices.
MD_CHANGE_CLEAN
   The array has transitions from 'clean' to 'dirty' or back,
   requiring a superblock update on active devices, but possibly
   not on spares
MD_CHANGE_PENDING
   A superblock update is underway.

We wait for an update to complete by waiting for all flags to be clear.  A
flag can be set at any time, even during an update, without risk that the
change will be lost.

Stop exporting md_update_sb - isn't needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
NeilBrown
6814d5368d [PATCH] md: factor out part of raid10d into a separate function.
raid10d has toooo many nested block, so take the fix_read_error functionality
out into a separate function.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:17 -07:00
Adrian Bunk
fbedac04fa [PATCH] md: the scheduled removal of the START_ARRAY ioctl for md
This patch contains the scheduled removal of the START_ARRAY ioctl for md.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:16 -07:00
Bryn Reeves
999d816851 [PATCH] dm table: add target flush
This patch adds support for a per-target dm_flush_fn method.  This is needed
to allow dm-loop to invalidate page cache mappings in response to BLKFLSBUF
ioctl commands.

Signed-off-by: Bryn Reeves <breeves@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:16 -07:00
Bryn Reeves
3cb4021453 [PATCH] dm: extract device limit setting
Separate the setting of device I/O limits from dm_get_device().  dm-loop will
use this.

Signed-off-by: Bryn Reeves <breeves@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:16 -07:00
Stefan Bader
9faf400f7e [PATCH] dm: use private biosets
I found a problem within device-mapper that occurs in low-mem situations.  It
was found using a mirror target but I think in theory it would hit any setup
that stacks device-mapper devices (like LVM on top of multipath).

Since device-mapper core uses the common fs_bioset in clone_bio(), and a
private, but still global, bio_set in split_bvec() it is possible that the
filesystem and the first level target successfully get bios but the lower
level target doesn't because there is no more memory and the pool was drained
by upper layers.  So the remapping will be stuck forever.  To solve this
device-mapper core needs to use a private bio_set for each device.

Signed-off-by: Stefan Bader <Stefan.Bader@de.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:16 -07:00
Milan Broz
6a24c71843 [PATCH] dm crypt: use private biosets
In the low memory situation dm-crypt needs to use a private mempool of bios to
avoid blocking.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:16 -07:00
Milan Broz
23541d2d28 [PATCH] dm crypt: move io to workqueue
This patch is designed to help dm-crypt comply with the
new constraints imposed by the following patch in -mm:
  md-dm-reduce-stack-usage-with-stacked-block-devices.patch

Under low memory the existing implementation relies upon waiting for I/O
submitted recursively to generic_make_request() completing before the original
generic_make_request() call can return.

This patch moves the I/O submission to a workqueue so the original
generic_make_request() can return immediately.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:16 -07:00
Milan Broz
93e605c237 [PATCH] dm crypt: restructure write processing
Restructure the dm-crypt write processing in preparation for workqueue changes
in the next patches.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Milan Broz
8b00445716 [PATCH] dm crypt: restructure for workqueue change
Restructure part of the dm-crypt code in preparation for workqueue changes.

Use 'base_bio' or 'clone' variable names consistently throughout.  No
functional changes are included in this patch.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Milan Broz
e48d4bbf96 [PATCH] dm crypt: add key msg
Add the facility to wipe the encryption key from memory (for example while a
laptop is suspended) and reinstate it later (when the laptop gets resumed).

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Milan Broz
8757b7764f [PATCH] dm table: add target preresume
This patch adds a target preresume hook.

It is called before the targets are resumed and if it returns an error the
resume gets cancelled.

The crypt target will use this to indicate that it is unable to process I/O
because no encryption key has been supplied.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Bryn Reeves
cc1092019c [PATCH] dm: add debug macro
Add CONFIG_DM_DEBUG and DMDEBUG() macro.

Signed-off-by: Bryn Reeves <breeves@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Hannes Reinecke
8560ed6fa8 [PATCH] dm: add uevent change event on resume
Device-mapper devices are not accessible until a 'resume' ioctl has been
issued.  For userspace to find out when this happens we need to generate an
uevent for udev to take appropriate action.

As discussed at OLS we should send 'change' events for 'resume'.  We can think
of no useful purpose served by also having 'suspend' events.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Micha³ Miros³aw
e69fae561f [PATCH] dm mpath: use kzalloc
Use kzalloc() instead of kmalloc() + memset().

Signed-off-by: Micha³ Miros³aw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Micha³ Miros³aw
28f16c2039 [PATCH] dm mpath: tidy ctr
After initialising m->ti, there's no need to pass it in subsequent calls to
static functions used for parsing parameters.

Signed-off-by: Micha³ Miros³aw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Jonathan Brassow
e52b8f6dbe [PATCH] dm mirror: remove trailing space from table
Remove trailing space from 'dmsetup table' output.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Alasdair G Kergon
695368ac33 [PATCH] dm snapshot: fix freeing pending exception
If a snapshot became invalid while there are outstanding pending_exceptions,
when pending_complete() processes each one it forgets to remove the
corresponding exception from its exception table before freeing it.

Fix this by moving the 'out:' label up one statement so that
remove_exception() is always called.  Then __invalidate_exception() no longer
needs to call it and its 'pe' argument become superfluous.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Alasdair G Kergon
4b832e8de2 [PATCH] dm snapshot: tidy pe ref counting
Rename sibling_count to ref_count and introduce get and put functions.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Alasdair G Kergon
ca3a931fd3 [PATCH] dm snapshot: add workqueue
Add a workqueue so that I/O can be queued up to be flushed from a separate
thread (e.g.  if local interrupts are disabled).

A new per-snapshot spinlock pe_lock is introduced to protect queued_bios.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Alasdair G Kergon
9d493fa8c9 [PATCH] dm snapshot: tidy pending_complete
This patch rearranges the pending_complete() code so that the functional
changes in subsequent patches are clearer.

By consolidating the error and the non-error paths, we can move
error_snapshot_bios() and __flush_bios() in line.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Alasdair G Kergon
ba40a2aa6e [PATCH] dm snapshot: tidy snapshot_map
This patch rearranges the snapshot_map code so that the functional changes in
subsequent patches are clearer.

The only functional change is to replace the existing read lock with a write
lock which the next patch needs.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Mark McLoughlin
927ffe7c9a [PATCH] dm snapshot: fix metadata writing when suspending
When suspending a device-mapper device, dm_suspend() sleeps until all
necessary I/O is completed.  This state is triggered by a callback from
persistent_commit().  But some I/O can still be issued *after* the callback
(to prepare the next metadata area for use if the current one is full).  This
patch delays the callback until after that I/O is complete.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Mark McLoughlin
e4ff496db7 [PATCH] dm snapshot: make read and write exception functions void
read_exception() and write_exception() only return an error if supplied with
an out-of-range index.  If this ever happens it's the result of a bug in the
calling code so we handle this with an assertion and remove the error handling
in the callers.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Mark McLoughlin
f9cea4f707 [PATCH] dm snapshot: fix metadata error handling
Fix the error handling when store.read_metadata is called: the error should be
returned immediately.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Mark McLoughlin
4c7e3bf44d [PATCH] dm snapshot: allow zero chunk_size
The chunk size of snapshots cannot be changed so it is redundant to require it
as a parameter when activating an existing snapshot.  Allow a value of zero in
this case and ignore it.  For a new snapshot, use a default value if zero is
specified.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Milan Broz
92c060a692 [PATCH] dm snapshot: fix invalidation ENOMEM
Fix ENOMEM error sign.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Ishai Rabinovitz
e3f6ac6123 [PATCH] dm: fix alloc_dev error path
While reading the code I found a bug in the error path in alloc_dev in dm.c

When blk_alloc_queue fails there is no call to free_minor.

This patch fixes the problem.

Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:14 -07:00
Milan Broz
e90dae1f58 [PATCH] dm: support ioctls on mapped devices: fix with fake file
The new ioctl code passes the wrong file pointer to the underlying device.
No file pointer is available so make a temporary fake one.

ioctl_by_bdev() does set_fs(KERNEL_DS) so it's for ioctls originating
within the kernel and unsuitable here.  We are processing ioctls that
originated in userspace and mapping them to different devices.  Fixing the
existing callers that pass a NULL file struct and consolidating the
fake_file users are separate matters to solve in later patches.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:13 -07:00
Alasdair G Kergon
7006f6eca8 [PATCH] dm: export blkdev_driver_ioctl
Export blkdev_driver_ioctl for device-mapper.

If we get as far as the device-mapper ioctl handler, we know the ioctl is not
a standard block layer BLK* one, so we don't need to check for them a second
time and can call blkdev_driver_ioctl() directly.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:13 -07:00
Milan Broz
9af4aa30b7 [PATCH] dm mpath: support ioctls
When an ioctl is performed on a multipath device simply pass it on to the
underlying block device through current_path.  If current path is not yet
selected, select it.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:13 -07:00
Milan Broz
ab17ffa440 [PATCH] dm linear: support ioctls
When an ioctl is performed on a device with a linear target, simply pass it on
to the underlying block device.

Note that the ioctl will pass through the filtering in blkdev_ioctl() twice.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:13 -07:00
Milan Broz
aa129a2247 [PATCH] dm: support ioctls on mapped devices
Extend the core device-mapper infrastructure to accept arbitrary ioctls on a
mapped device provided that it has exactly one target and it is capable of
supporting ioctls.

[We can't use unlocked_ioctl because we need 'inode': 'file' might be NULL.
Is it worth changing this?]

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

Arnd Bergmann <arnd@arndb.de> wrote:

> Am Wednesday 21 June 2006 21:31 schrieb Alasdair G Kergon:
> > static struct block_device_operations dm_blk_dops = {
> > .open = dm_blk_open,
> > .release = dm_blk_close,
> > +.ioctl = dm_blk_ioctl,
> > .getgeo = dm_blk_getgeo,
> > .owner = THIS_MODULE
>
> I guess this also needs a ->compat_ioctl method, otherwise it won't
> work for ioctl numbers that have a compat_ioctl implementation in the
> low-level device driver.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:13 -07:00
David Howells
9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
Jens Axboe
4aff5e2333 [PATCH] Split struct request ->flags into two parts
Right now ->flags is a bit of a mess: some are request types, and
others are just modifiers. Clean this up by splitting it into
->cmd_type and ->cmd_flags. This allows introduction of generic
Linux block message types, useful for sending generic Linux commands
to block devices.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:23:37 +02:00
Rik Snel
3c164bd815 [BLOCK] dm-crypt: trivial comment improvements
Just some minor comment nits.

- little-endian is better than low-endian
- and since it is called essiv everywere it should also be essiv
  in the comments (and not ess_iv)

Signed-off-by: Rik Snel <rsnel@cube.dyndns.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2006-09-21 11:46:27 +10:00
Herbert Xu
3505868791 [CRYPTO] users: Use crypto_hash interface instead of crypto_digest
This patch converts all remaining crypto_digest users to use the new
crypto_hash interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2006-09-21 11:46:21 +10:00
Herbert Xu
d1806f6a97 [BLOCK] dm-crypt: Use block ciphers where applicable
This patch converts dm-crypt to use the new block cipher type where
applicable.  It also changes simple cipher operations to use the new
encrypt_one/decrypt_one interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2006-09-21 11:46:13 +10:00
NeilBrown
ddac7c7e3a [PATCH] md: Fix issues with referencing rdev in md/raid1
We need to be careful when referencing mirrors[i].rdev.  It can disappear
under us at various times.

So:
  fix a couple of problem places.
  comment a couple of non-problem places
  move an 'atomic_add' which deferences rdev down a little
    way to some where where it is sure to not be NULL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-01 11:39:08 -07:00
NeilBrown
6394cca548 [PATCH] md: fix recent breakage of md/raid1 array checking
A recent patch broke the ability to do a user-request check of a raid1.
This patch fixes the breakage and also moves a comment that was dislocated
by the same patch.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:31 -07:00
NeilBrown
8469219596 [PATCH] md: avoid backward event updates in md superblock when degraded.
If we
  - shut down a clean array,
  - restart with one (or more) drive(s) missing
  - make some changes
  - pause, so that they array gets marked 'clean',
the event count on the superblock of included drives
will be the same as that of the removed drives.
So adding the removed drive back in will cause it
to be included with no resync.

To avoid this, we only update the eventcount backwards when the array
is not degraded.  In this case there can (should) be no non-connected
drives that we can get confused with, and this is the particular case
where updating-backwards is valuable.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:31 -07:00
Daniel Kobras
c06aad854f [PATCH] dm: Fix deadlock under high i/o load in raid1 setup.
On an nForce4-equipped machine with two SATA disk in raid1 setup using dmraid,
we experienced frequent deadlock of the system under high i/o load.  'cat
/dev/zero > ~/zero' was the most reliable way to reproduce them: Randomly
after a few GB, 'cp' would be left in 'D' state along with kjournald and
kmirrord.  The functions cp and kjournald were blocked in did vary, but
kmirrord's wchan always pointed to 'mempool_alloc()'.  We've seen this pattern
on 2.6.15 and 2.6.17 kernels.  http://lkml.org/lkml/2005/4/20/142 indicates
that this problem has been around even before.

So much for the facts, here's my interpretation: mempool_alloc() first tries
to atomically allocate the requested memory, or falls back to hand out
preallocated chunks from the mempool.  If both fail, it puts the calling
process (kmirrord in this case) on a private waitqueue until somebody refills
the pool.  Where the only 'somebody' is kmirrord itself, so we have a
deadlock.

I worked around this problem by falling back to a (blocking) kmalloc when
before kmirrord would have ended up on the waitqueue.  This defeats part of
the benefits of using the mempool, but at least keeps the system running.  And
it could be done with a two-line change.  Note that mempool_alloc() clears the
GFP_NOIO flag internally, and only uses it to decide whether to wait or return
an error if immediate allocation fails, so the attached patch doesn't change
behaviour in the non-deadlocking case.  Path is against current git
(2.6.18-rc4), but should apply to earlier versions as well.  I've tested on
2.6.15, where this patch makes the difference between random lockup and a
stable system.

Signed-off-by: Daniel Kobras <kobras@linux.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:28 -07:00
Michal Miroslaw
485311a23c [PATCH] dm: BUG/OOPS fix
Fix BUG I tripped on while testing failover and multipathing.

BUG shows up on error path in multipath_ctr() when parse_priority_group()
fails after returning at least once without error.  The fix is to
initialize m->ti early - just after alloc()ing it.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c027c3d2
*pde = 00000000
Oops: 0000 [#3]
Modules linked in: qla2xxx ext3 jbd mbcache sg ide_cd cdrom floppy
CPU:    0
EIP:    0060:[<c027c3d2>]    Not tainted VLI
EFLAGS: 00010202   (2.6.17.3 #1)
EIP is at dm_put_device+0xf/0x3b
eax: 00000001   ebx: ee4fcac0   ecx: 00000000   edx: ee4fcac0
esi: ee4fc4e0   edi: ee4fc4e0   ebp: 00000000   esp: c5db3e78
ds: 007b   es: 007b   ss: 0068
Process multipathd (pid: 15912, threadinfo=c5db2000 task=ef485a90)
Stack: ec4eda40 c02816bd ee4fc4c0 00000000 f7e89498 f883e0bc c02816f6 f7e89480
       f7e8948c c0281801 ffffffea f7e89480 f883e080 c0281ffe 00000001 00000000
       00000004 dfe9cab8 f7a693c0 f883e080 f883e0c0 ca4b99c0 c027c6ee 01400000
Call Trace:
 <c02816bd> free_pgpaths+0x31/0x45  <c02816f6> free_priority_group+0x25/0x2e
 <c0281801> free_multipath+0x35/0x67  <c0281ffe> multipath_ctr+0x123/0x12d
 <c027c6ee> dm_table_add_target+0x11e/0x18b  <c027e5b4> populate_table+0x8a/0xaf
 <c027e62b> table_load+0x52/0xf9  <c027ec23> ctl_ioctl+0xca/0xfc
 <c027e5d9> table_load+0x0/0xf9  <c0152146> do_ioctl+0x3e/0x43
 <c0152360> vfs_ioctl+0x16c/0x178  <c01523b4> sys_ioctl+0x48/0x60
 <c01029b3> syscall_call+0x7/0xb
Code: 97 f0 00 00 00 89 c1 83 c9 01 80 e2 01 0f 44 c1 88 43 14 8b 04 24 59 5b 5e 5f 5d c3 53 89 c1 89 d3 ff 4a 08 0f 94 c0 84 c0 74 2a <8b> 01 8b 10 89 d8 e8 f6 fb ff ff 8b 03 8b 53 04 89 50 04 89 02
EIP: [<c027c3d2>] dm_put_device+0xf/0x3b SS:ESP 0068:c5db3e78

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:29 -07:00
NeilBrown
f9abd1ace4 [PATCH] md: Fix a bug that recently crept into md/linear
A recent patch that allowed linear arrays to be reconfigured on-line
allowed in a bug which results in divide by zero - not all
mddev->array_size were converted to conf->array_size.

This patch finished the conversion and fixed the bug.

The offending patch was commit 7c7546ccf6.

Thanks to Simon Kirby <sim@netnation.com> for the bug report.

Cc: Simon Kirby <sim@netnation.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:46 -07:00