Commit graph

946 commits

Author SHA1 Message Date
FUJITA Tomonori
9f5de6b105 [SCSI] bsg: add large command support
This enables bsg to handle the request length larger than BLK_MAX_CDB
(mainly for the variable length CDB format).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-05-02 10:17:35 -05:00
Harvey Harrison
24c03d47d0 block: remove remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:02 -07:00
Peter Zijlstra
cf0ca9fe5d mm: bdi: export BDI attributes in sysfs
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.

In particular this properly exposes the read-ahead window for all relevant
users and /sys/block/<block>/queue/read_ahead_kb should be deprecated.

With patient help from Kay Sievers and Greg KH

[mszeredi@suse.cz]

 - split off NFS and FUSE changes into separate patches
 - document new sysfs attributes under Documentation/ABI
 - do bdi_class_init as a core_initcall, otherwise the "default" BDI
   won't be initialized
 - remove bdi_init_fmt macro, it's not used very much

[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Greg KH <greg@kroah.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:49 -07:00
Alan D. Brunelle
ac9fafa124 block: Skip I/O merges when disabled
The block I/O + elevator + I/O scheduler code spend a lot of time trying
to merge I/Os -- rightfully so under "normal" circumstances. However,
if one were to know that the incoming I/O stream was /very/ random in
nature, the cycles are wasted.

This patch adds a per-request_queue tunable that (when set) disables
merge attempts (beyond the simple one-hit cache check), thus freeing up
a non-trivial amount of CPU cycles.

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
FUJITA Tomonori
d7e3c3249e block: add large command support
This patch changes rq->cmd from the static array to a pointer to
support large commands.

We rarely handle large commands. So for optimization, a struct request
still has a static array for a command. rq_init sets rq->cmd pointer
to the static array.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
FUJITA Tomonori
d34c87e4ba block: replace sizeof(rq->cmd) with BLK_MAX_CDB
This is a preparation for changing rq->cmd from the static array to a
pointer.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
FUJITA Tomonori
2a4aa30c5f block: rename and export rq_init()
This rename rq_init() blk_rq_init() and export it. Any path that hands
the request to the block layer needs to call it to initialize the
request.

This is a preparation for large command support, which needs to
initialize the request in a proper way (that is, just doing a memset()
will not work).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
FUJITA Tomonori
992b5bceee block: no need to initialize rq->cmd with blk_get_request
blk_get_request initializes rq->cmd (rq_init does) so the users don't
need to do that.

The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd,
as a preparation for large command support, which changes rq->cmd from
the static array to a pointer. sizeof(rq->cmd) will not make sense and
&rq->cmd won't work.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:55 +02:00
Adrian Bunk
6f6a036e6e block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
This patch fixes the following build error with UML and gcc 4.3:

<--  snip  -->

...
  CC      block/blk-barrier.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c: In function ‘blk_do_ordered’:
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:57: sorry, unimplemented: inlining failed in call to ‘blk_ordered_cur_seq’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:252: sorry, unimplemented: called from here
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:57: sorry, unimplemented: inlining failed in call to ‘blk_ordered_cur_seq’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:253: sorry, unimplemented: called from here
make[2]: *** [block/blk-barrier.o] Error 1

<--  snip  -->

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:54 +02:00
Adrian Bunk
72ed0bf60a block/elevator.c:elv_rq_merge_ok() mustn't be inline
This patch fixes the following build error with UML and gcc 4.3:

<--  snip  -->

...
  CC      block/elevator.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c: In function ‘elv_merge’:
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:103: sorry, unimplemented: called from here
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
/home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:495: sorry, unimplemented: called from here
make[2]: *** [block/elevator.o] Error 1
make[1]: *** [block] Error 2

<--  snip  -->

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:54 +02:00
Nick Piggin
75ad23bc0f block: make queue flags non-atomic
We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 14:48:33 +02:00
FUJITA Tomonori
68154e90c9 block: add dma alignment and padding support to blk_rq_map_kern
This patch adds bio_copy_kern similar to
bio_copy_user. blk_rq_map_kern uses bio_copy_kern instead of
bio_map_kern if necessary.

bio_copy_kern uses temporary pages and the bi_end_io callback frees
these pages. bio_copy_kern saves the original kernel buffer at
bio->bi_private it doesn't use something like struct bio_map_data to
store the information about the caller.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 09:50:34 +02:00
Adrian Bunk
657e93be35 unexport blk_max_pfn
blk_max_pfn can now be unexported.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 09:50:34 +02:00
FUJITA Tomonori
1afb20f301 block: make rq_init() do a full memset()
This requires moving rq_init() from get_request() to blk_alloc_request().
The upside is that we can now require an rq_init() from any path that
wishes to hand the request to the block layer.

rq_init() will be exported for the code that uses struct request
without blk_get_request.

This is a preparation for large command support, which needs to
initialize struct request in a proper way (that is, just doing a
memset() will not work).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-29 09:50:34 +02:00
FUJITA Tomonori
97f46ae45c [SCSI] bsg: add release callback support
This patch adds release callback support, which is called when a bsg
device goes away. bsg_register_queue() takes a pointer to a callback
function. This feature is useful for stuff like sas_host that can't
use the release callback in struct device.

If a caller doesn't need bsg's release callback, it can call
bsg_register_queue() with NULL pointer (e.g. scsi devices can use
release callback in struct device so they don't need bsg's callback).

With this patch, bsg uses kref for refcounts on bsg devices instead of
get/put_device in fops->open/release. bsg calls put_device and the
caller's release callback (if it was registered) in kref_put's
release.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-04-22 15:16:32 -05:00
Linus Torvalds
548453fd10 Merge branch 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block:
  block: fix blk_register_queue() return value
  block: fix memory hotplug and bouncing in block layer
  block: replace remaining __FUNCTION__ occurrences
  Kconfig: clean up block/Kconfig help descriptions
  cciss: fix warning oops on rmmod of driver
  cciss: Fix race between disk-adding code and interrupt handler
  block: move the padding adjustment to blk_rq_map_sg
  block: add bio_copy_user_iov support to blk_rq_map_user_iov
  block: convert bio_copy_user to bio_copy_user_iov
  loop: manage partitions in disk image
  cdrom: use kmalloced buffers instead of buffers on stack
  cdrom: make unregister_cdrom() return void
  cdrom: use list_head for cdrom_device_info list
  cdrom: protect cdrom_device_info list by mutex
  cdrom: cleanup hardcoded error-code
  cdrom: remove ifdef CONFIG_SYSCTL
2008-04-21 16:03:40 -07:00
Akinobu Mita
fb19974630 block: fix blk_register_queue() return value
blk_register_queue() returns -ENXIO when queue->request_fn is NULL.  But there
are some block drivers that call blk_register_queue() via add_disk() with
queue->request_fn == NULL.  (For example, brd, loop)

Although no one checks return value of blk_register_queue(), this patch makes
it return 0 instead of -ENXIO when queue->request_fn is NULL,

Also this patch adds warning when blk_register_queue() and
blk_unregister_queue() are called with queue == NULL rather than ignore
invalid usage silently.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-21 09:51:06 +02:00
Nick Andrew
ee86418d39 Kconfig: clean up block/Kconfig help descriptions
Modify the help descriptions of block/Kconfig for clarity, accuracy and
consistency.

Refactor the BLOCK description a bit.  The wording "This permits ...  to be
removed" isn't quite right; the block layer is removed when the option is
disabled, whereas most descriptions talk about what happens when the option is
enabled.  Reformat the list of what is affected by disabling the block layer.

Add more examples of large block devices to LBD and strive for technical
accuracy; block devices of size _exactly_ 2TB require CONFIG_LBD, not only
"bigger than 2TB".  Also try to say (perhaps not very clearly) that the config
option is only needed when you want to have individual block devices of size
>= 2TB, for example if you had 3 x 1TB disks in your computer you'd have a
total storage size of 3TB but you wouldn't need the option unless you want to
aggregate those disks into a RAID or LVM.

Improve terminology and grammar on BLK_DEV_IO_TRACE.

I also added the boilerplate "If unsure, say N" to most options.

Precisely say "2TB and larger" for LSF.

Indent the help text for BLK_DEV_BSG by 2 spaces in accordance with the
standard.

Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-21 09:51:04 +02:00
FUJITA Tomonori
f18573abcc block: move the padding adjustment to blk_rq_map_sg
blk_rq_map_user adjusts bi_size of the last bio. It breaks the rule
that req->data_len (the true data length) is equal to sum(bio). It
broke the scsi command completion code.

commit e97a294ef6 was introduced to fix
the above issue. However, the partial completion code doesn't work
with it. The commit is also a layer violation (scsi mid-layer should
not know about the block layer's padding).

This patch moves the padding adjustment to blk_rq_map_sg (suggested by
James). The padding works like the drain buffer. This patch breaks the
rule that req->data_len is equal to sum(sg), however, the drain buffer
already broke it. So this patch just restores the rule that
req->data_len is equal to sub(bio) without breaking anything new.

Now when a low level driver needs padding, blk_rq_map_user and
blk_rq_map_user_iov guarantee there's enough room for padding.
blk_rq_map_sg can safely extend the last entry of a scatter list.

blk_rq_map_sg must extend the last entry of a scatter list only for a
request that got through bio_copy_user_iov. This patches introduces
new REQ_COPY_USER flag.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-21 09:50:08 +02:00
FUJITA Tomonori
afdc1a780e block: add bio_copy_user_iov support to blk_rq_map_user_iov
With this patch, blk_rq_map_user_iov uses bio_copy_user_iov when a low
level driver needs padding or a buffer in sg_iovec isn't aligned. That
is, it uses temporary kernel buffers instead of mapping user pages
directly.

When a LLD needs padding, later blk_rq_map_sg needs to extend the last
entry of a scatter list. bio_copy_user_iov guarantees that there is
enough space for padding by using temporary kernel buffers instead of
user pages.

blk_rq_map_user_iov needs buffers in sg_iovec to be aligned. The
comment in blk_rq_map_user_iov indicates that drivers/scsi/sg.c also
needs buffers in sg_iovec to be aligned. Actually, drivers/scsi/sg.c
works with unaligned buffers in sg_iovec (it always uses temporary
kernel buffers).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-21 09:50:08 +02:00
Tony Jones
ee959b00c3 SCSI: convert struct class_device to struct device
It's big, but there doesn't seem to be a way to split it up smaller...

Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-19 19:10:33 -07:00
Linus Torvalds
2cca775bae Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (137 commits)
  [SCSI] iscsi: bidi support for iscsi_tcp
  [SCSI] iscsi: bidi support at the generic libiscsi level
  [SCSI] iscsi: extended cdb support
  [SCSI] zfcp: Fix error handling for blocked unit for send FCP command
  [SCSI] zfcp: Remove zfcp_erp_wait from slave destory handler to fix deadlock
  [SCSI] zfcp: fix 31 bit compile warnings
  [SCSI] bsg: no need to set BSG_F_BLOCK bit in bsg_complete_all_commands
  [SCSI] bsg: remove minor in struct bsg_device
  [SCSI] bsg: use better helper list functions
  [SCSI] bsg: replace kobject_get with blk_get_queue
  [SCSI] bsg: takes a ref to struct device in fops->open
  [SCSI] qla1280: remove version check
  [SCSI] libsas: fix endianness bug in sas_ata
  [SCSI] zfcp: fix compiler warning caused by poking inside new semaphore (linux-next)
  [SCSI] aacraid: Do not describe check_reset parameter with its value
  [SCSI] aacraid: Fix down_interruptible() to check the return value
  [SCSI] sun3_scsi_vme: add MODULE_LICENSE
  [SCSI] st: rename flush_write_buffer()
  [SCSI] tgt: use KMEM_CACHE macro
  [SCSI] initio: fix big endian problems for auto request sense
  ...
2008-04-18 11:25:31 -07:00
FUJITA Tomonori
99773aab03 [SCSI] bsg: no need to set BSG_F_BLOCK bit in bsg_complete_all_commands
Before bsg_complete_all_commands is called, BSG_F_BLOCK bit is always
set.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-04-18 11:48:43 -05:00
FUJITA Tomonori
842ea771c3 [SCSI] bsg: remove minor in struct bsg_device
minor in struct bsg_device is used as identifier to find the
corresponding struct bsg_device_class. However, request_queuse can be
used as identifier for that and the minor in struct bsg_device is
unnecessary.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-04-18 11:48:26 -05:00
FUJITA Tomonori
43ac9e62c4 [SCSI] bsg: use better helper list functions
This replace hlist_for_each and list_entry with hlist_for_each_entry
and list_first_entry respectively.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-04-18 11:48:08 -05:00
FUJITA Tomonori
c3ff1b90d8 [SCSI] bsg: replace kobject_get with blk_get_queue
Both takes a ref to a queue. But blk_get_queue checks QUEUE_FLAG_DEAD
and is more appropriate interface here.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-04-18 11:47:49 -05:00
FUJITA Tomonori
d45ac4fa8f [SCSI] bsg: takes a ref to struct device in fops->open
bsg_register_queue() takes a ref to struct device that a caller
passes. For example, bsg takes a ref to the sdev_gendev for scsi
devices. However, bsg doesn't inrease the refcount in fops->open. So
while an application opens a bsg device, the scsi device that the bsg
device holds can go away (bsg also takes a ref to a queue, but it
doesn't prevent the device from going away).

With this patch, bsg increases the refcount of struct device in
fops->open and decreases it in fops->release.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-04-18 11:47:19 -05:00
Bartlomiej Zolnierkiewicz
93de00fd1c ide: remove broken/dangerous HDIO_[UNREGISTER,SCAN]_HWIF ioctls (take 3)
hdparm explicitely marks HDIO_[UNREGISTER,SCAN]_HWIF ioctls as DANGEROUS
and given the number of bugs we can assume that there are no real users:

* DMA has no chance of working because DMA resources are released by
  ide_unregister() and they are never allocated again.

* Since ide_init_hwif_ports() is used for ->io_ports[] setup the ioctls
  don't work for almost all hosts with "non-standard" (== non ISA-like)
  layout of IDE taskfile registers (there is a lot of such host drivers).

* ide_port_init_devices() is not called when probing IDE devices so:
  - drive->autotune is never set and IDE host/devices are not programmed
    for the correct PIO/DMA transfer modes (=> possible data corruption)
  - host specific I/O 32-bit and IRQ unmasking settings are not applied
    (=> possible data corruption)
  - host specific ->port_init_devs method is not called (=> no luck with
    ht6560b, qd65xx and opti621 host drivers)

* ->rw_disk method is not preserved (=> no HPT3xxN chipsets support).

* ->serialized flag is not preserved (=> possible data corruption when
   using icside, aec62xx (ATP850UF chipset), cmd640, cs5530, hpt366
   (HPT3xxN chipsets), rz1000, sc1200, dtc2278 and ht6560b host drivers).

* ->ack_intr method is not preserved (=> needed by ide-cris, buddha,
  gayle and macide host drivers).

* ->sata_scr[] and sata_misc[] is cleared by ide_unregister() and it
  isn't initialized again (SiI3112 support needs them).

* To issue an ioctl() there need to be at least one IDE device present
  in the system.

* ->cable_detect method is not preserved + it is not called when probing
  IDE devices so cable detection is broken (however since DMA support is
  also broken it doesn't really matter ;-).

* Some objects which may have already been freed in ide_unregister()
  are restored by ide_hwif_restore() (i.e. ->hwgroup).

* ide_register_hw() may unregister unrelated IDE ports if free ide_hwifs[]
  slot cannot be found.

* When IDE host drivers are modular unregistered port may be re-used by
  different host driver that owned it first causing subtle bugs.

Since we now have a proper warm-plug support remove these ioctls,
then remove no longer needed:
- ide_register_hw() and ide_hwif_restore() functions
- 'init_default' and 'restore' arguments of ide_unregister()
- zeroeing of hwif->{dma,extra}_* fields in ide_unregister()

As an added bonus IDE core code size shrinks by ~3kB (x86-32).

v2:
* fix ide_unregister() arguments in cleanup_module() (Andrew Morton).

v3:
* fix ide_unregister() arguments in palm_bk3710.c.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-04-18 00:46:24 +02:00
Jens Axboe
75ce6faccd block: update git url for blktrace
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-15 10:23:35 +02:00
Fabio Checconi
4faa3c8150 cfq-iosched: do not leak ioc_data across iosched switches
When switching scheduler from cfq, cfq_exit_queue() does not clear
ioc->ioc_data, leaving a dangling pointer that can deceive the following
lookups when the iosched is switched back to cfq.  The pattern that can
trigger that is the following:

    - elevator switch from cfq to something else;
    - module unloading, with elv_unregister() that calls cfq_free_io_context()
      on ioc freeing the cic (via the .trim op);
    - module gets reloaded and the elevator switches back to cfq;
    - reallocation of a cic at the same address as before (with a valid key).

To fix it just assign NULL to ioc_data in __cfq_exit_single_io_context(),
that is called from the regular exit path and from the elevator switching
code.  The only path that frees a cic and is not covered is the error handling
one, but cic's freed in this way are never cached in ioc_data.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-10 08:28:01 +02:00
Fabio Checconi
34e6bbf23c cfq-iosched: fix rcu freeing of cfq io contexts
SLAB_DESTROY_BY_RCU is not a direct substitute for normal call_rcu()
freeing, since it'll page freeing but NOT object freeing. So change
cfq to do the freeing on its own.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-02 15:42:20 +02:00
Andrea Arcangeli
00d61e3e8c Fix bounce setting for 64-bit
Looking a bit closer into this regression the reason this can't be
right is that dma_addr common default is BLK_BOUNCE_HIGH and most
machines have less than 4G. So if you do:

    if (b_pfn <= (min_t(u64, 0xffffffff, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
	dma = 1

that will translate to:

     if (BLK_BOUNCE_HIGH <= BLK_BOUNCE_HIGH)
     	dma = 1

So for 99% of hardware this will trigger unnecessary GFP_DMA
allocations and isa pooling operations.

Also note how the 32bit code still does b_pfn < blk_max_low_pfn.

I guess this is what you were looking after. I didn't verify but as
far as I can tell, this will stop the regression with isa dma
operations at boot for 99% of blkdev/memory combinations out there and
I guess this fixes the setups with >4G of ram and 32bit pci cards as
well (this also retains symmetry with the 32bit code).

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-02 09:06:44 +02:00
Roland McGrath
ee27a558ae genhd must_check warning fix
Fixes:

	block/genhd.c:361: warning: ignoring return value of ‘class_register’, declared with attribute warn_unused_result

Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-12 12:34:37 -07:00
Jens Axboe
cc66b4512c block: fix blkdev_issue_flush() not detecting and passing EOPNOTSUPP back
This is important to eg dm, that tries to decide whether to stop using
barriers or not.

Tested as working by Anders Henke <anders.henke@1und1.de>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:47:46 +01:00
Harvey Harrison
56d94a37f6 block: fix shadowed variable warning in blk-map.c
Introduced between 2.6.25-rc2 and -rc3
block/blk-map.c:154:14: warning: symbol 'bio' shadows an earlier one
block/blk-map.c:110:13: originally declared here

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:31:22 +01:00
Harvey Harrison
448da4d262 block: remove extern on function definition
Intoduced between 2.6.25-rc2 and -rc3
block/blk-settings.c:319:12: warning: function 'blk_queue_dma_drain' with external linkage has definition

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:30:18 +01:00
Adrian Bunk
bec419404a unexport blk_rq_map_user_iov
This patch removes the unused export of blk_rq_map_user_iov.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:28:34 +01:00
Adrian Bunk
9d7f1e6b9b unexport blk_{get,put}_queue
This patch removes the unused exports of blk_{get,put}_queue.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:28:32 +01:00
Adrian Bunk
1826eadfc4 block/genhd.c: cleanups
This patch contains the following cleanups:
- make the needlessly global struct disk_type static
- #if 0 the unused genhd_media_change_notify()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:28:31 +01:00
Adrian Bunk
ff88972c85 proper prototype for blk_dev_init()
This patch adds a proper prototye for blk_dev_init() in block/blk.h

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:28:29 +01:00
Adrian Bunk
278caf0120 block/blk-tag.c should #include "blk.h"
Every file should include the headers containing the externs for its
global functions (in this case for __blk_queue_free_tags()).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:28:24 +01:00
Yang Shi
419c434c35 Fix DMA access of block device in 64-bit kernel on some non-x86 systems with 4GB or upper 4GB memory
For some non-x86 systems with 4GB or upper 4GB memory,
we need increase the range of addresses that can be
used for direct DMA in 64-bit kernel.

Signed-off-by: Yang Shi <yang.shi@windriver.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:20:51 +01:00
Tejun Heo
e3790c7d42 block: separate out padding from alignment
Block layer alignment was used for two different purposes - memory
alignment and padding.  This causes problems in lower layers because
drivers which only require memory alignment ends up with adjusted
rq->data_len.  Separate out padding such that padding occurs iff
driver explicitly requests it.

Tomo: restorethe code to update bio in blk_rq_map_user
      introduced by the commit 40b01b9bbd
      according to padding alignment.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:18:17 +01:00
FUJITA Tomonori
7a85f8896f block: restore the meaning of rq->data_len to the true data length
The meaning of rq->data_len was changed to the length of an allocated
buffer from the true data length. It breaks SG_IO friends and
bsg. This patch restores the meaning of rq->data_len to the true data
length and adds rq->extra_len to store an extended length (due to
drain buffer and padding).

This patch also removes the code to update bio in blk_rq_map_user
introduced by the commit 40b01b9bbd.
The commit adjusts bio according to memory alignment
(queue_dma_alignment). However, memory alignment is NOT padding
alignment. This adjustment also breaks SG_IO friends and bsg. Padding
alignment needs to be fixed in a proper way (by a separate patch).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2008-03-04 11:17:11 +01:00
Randy Dunlap
5d87a052c7 block: fix kernel-docbook parameters and files
kernel-doc for block/:
- add missing parameters
- fix one function's parameter list (remove blank line)
- add 2 source files to docbook for non-exported kernel-doc functions

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-04 11:14:39 +01:00
Tejun Heo
db0a2e0099 block: clear drain buffer if draining for write command
Clear drain buffer before chaining if the command in question is a
write.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19 11:36:55 +01:00
Tejun Heo
2fb98e8414 block: implement request_queue->dma_drain_needed
Draining shouldn't be done for commands where overflow may indicate
data integrity issues.  Add dma_drain_needed callback to
request_queue.  Drain buffer is appened iff this function returns
non-zero.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19 11:36:53 +01:00
Tejun Heo
6b00769fe1 block: add request->raw_data_len
With padding and draining moved into it, block layer now may extend
requests as directed by queue parameters, so now a request has two
sizes - the original request size and the extended size which matches
the size of area pointed to by bios and later by sgs.  The latter size
is what lower layers are primarily interested in when allocating,
filling up DMA tables and setting up the controller.

Both padding and draining extend the data area to accomodate
controller characteristics.  As any controller which speaks SCSI can
handle underflows, feeding larger data area is safe.

So, this patch makes the primary data length field, request->data_len,
indicate the size of full data area and add a separate length field,
request->raw_data_len, for the unmodified request size.  The latter is
used to report to higher layer (userland) and where the original
request size should be fed to the controller or device.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19 11:36:35 +01:00
Tejun Heo
40b01b9bbd block: update bio according to DMA alignment padding
DMA start address and transfer size alignment for PC requests are
achieved using bio_copy_user() instead of bio_map_user().  This works
because bio_copy_user() always uses full pages and block DMA alignment
isn't allowed to go over PAGE_SIZE.

However, the implementation didn't update the last bio of the request
to make this padding visible to lower layers.  This patch makes
blk_rq_map_user() extend the last bio such that it includes the
padding area and the size of area pointed to by the request is
properly aligned.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19 11:35:38 +01:00
Jens Axboe
e164094964 elevator: make elevator_get() attempt to load the appropriate module
Currently we fail if someone requests a valid io scheduler, but it's
modular and not currently loaded. That can happen from a driver init
asking for a different scheduler, or online switching through sysfs
as requested by a user.

This patch makes elevator_get() request_module() to attempt to load
the appropriate module, instead of requiring that done manually.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19 10:20:37 +01:00
Jens Axboe
ffc4e75957 cfq-iosched: add hlist for browsing parallel to the radix tree
It's cumbersome to browse a radix tree from start to finish, especially
since we modify keys when a process exits. So add a hlist for the single
purpose of browsing over all known cfq_io_contexts, used for exit,
io prio change, etc.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=9948

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19 10:04:00 +01:00
Jens Axboe
84e9e03c55 block: make blk_rq_map_user() clear ->bio if it unmaps it
That way the interface is symmetric, and calling blk_rq_unmap_user()
on the request wont oops.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-19 10:04:00 +01:00
Adrian Bunk
52ff4cae65 make blk_settings_init() static
blk_settings_init() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2008-02-19 10:04:00 +01:00
Adrian Bunk
1334159826 make blk_ioc_init() static
blk_ioc_init() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2008-02-19 10:04:00 +01:00
Adrian Bunk
5ece6c52ea make blk-core.c:request_cachep static again
request_cachep needlessly became global.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2008-02-19 10:04:00 +01:00
Jerome Marchand
c3c930d933 Enhanced partition statistics: remove old partition statistics
Removes the now unused old partition statistic code.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-08 12:42:01 +01:00
Jerome Marchand
28f39d553e Enhanced partition statistics: procfs
Reports enhanced partition statistics in /proc/diskstats.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2008-02-08 12:42:00 +01:00
Jerome Marchand
6f2576af5b Enhanced partition statistics: update partition statitics
Updates the enhanced partition statistics in generic block layer
besides the disk statistics.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-08 12:41:56 +01:00
Jens Axboe
63a7138671 block: fixup rq_init() a bit
Rearrange fields in cache order and initialize some fields that
we didn't previously init. Remove init of ->completion_data, it's
part of a union with ->hash. Luckily clearing the rb node is the same
as setting it to null!

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-08 12:41:03 +01:00
Jens Axboe
3bc217ffe6 block: kill swap_io_context()
It blindly copies everything in the io_context, including the lock.
That doesn't work so well for either lock ordering or lockdep.

There seems zero point in swapping io contexts on a request to request
merge, so the best point of action is to just remove it.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-01 11:34:49 +01:00
Jens Axboe
8bdd3f8a69 as-iosched: fix inconsistent ioc->lock context
Since it's acquired from irq context, all locking must be of the
irq safe variant. Most are already inside the queue lock (which
already disables interrupts), but the io scheduler rmmod path
always has irqs enabled and the put_io_context() path may legally
be called with irqs enabled (even if it isn't usually). So fixup
those two.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-01 09:44:28 +01:00
Jens Axboe
4eb166d987 block: make elevator lib checkpatch compliant
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-01 09:26:33 +01:00
Jens Axboe
fe094d98e7 cfq-iosched: make checkpatch compliant
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-01 09:26:33 +01:00
Jens Axboe
6728cb0e63 block: make core bits checkpatch compliant
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-01 09:26:33 +01:00
Jens Axboe
22b132102f block: new end request handling interface should take unsigned byte counts
No point in passing signed integers as the byte count, they can never
be negative.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-02-01 09:26:33 +01:00
James Bottomley
40f620286d [SCSI] bsg: copy the cmd_type field to the subordinate request for bidi
This fixes a problem in SCSI where we use the (previously
uninitialised) cmd_type via blk_pc_request() to set up the transfer in
scsi_init_sgtable().

Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-01-30 13:14:26 -06:00
Jens Axboe
149a051f82 as-iosched: fix double locking bug in as_merged_requests()
If the two requests belong to the same io context, we will attempt
to lock the same lock twice. But swapping contexts is pointless in
that case, so just check for rioc == nioc before doing the double
lock and copy.

Tested-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-30 09:11:10 +01:00
Jan Engelhardt
12f32bb317 block: constify function pointer tables
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:19 +01:00
Martin K. Petersen
e68b903c6b Expose hardware sector size
Expose hardware sector size in sysfs queue directory.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:17 +01:00
Jens Axboe
d6d4819696 block: ll_rw_blk.c split, add blk-merge.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:12 +01:00
Jens Axboe
db1d08c646 block: remove dated (and wrong) comment in blk-core.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:11 +01:00
Jens Axboe
26b8256e2b block: get rid of unnecessary forward declarations in blk-core.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:09 +01:00
Jens Axboe
86db1e2977 block: continue ll_rw_blk.c splitup
Adds files for barrier handling, rq execution, io context handling,
mapping data to requests, and queue settings.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:08 +01:00
Jens Axboe
8324aa91d1 block: split tag and sysfs handling from blk-core.c
Seperates the tag and sysfs handling from ll_rw_blk.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:07 +01:00
Jens Axboe
a168ee84c9 block: first step of splitting ll_rw_blk, rename it
Then we retain history in blk-core.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-29 21:55:05 +01:00
Linus Torvalds
8d01eddf29 Merge branch 'for-2.6.25' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.25' of git://git.kernel.dk/linux-2.6-block:
  block: implement drain buffers
  __bio_clone: don't calculate hw/phys segment counts
  block: allow queue dma_alignment of zero
  blktrace: Add blktrace ioctls to SCSI generic devices
2008-01-29 08:51:56 +11:00
Linus Torvalds
f0f0052069 Merge branch 'blk-end-request' of git://git.kernel.dk/linux-2.6-block
* 'blk-end-request' of git://git.kernel.dk/linux-2.6-block: (30 commits)
  blk_end_request: changing xsysace (take 4)
  blk_end_request: changing ub (take 4)
  blk_end_request: cleanup of request completion (take 4)
  blk_end_request: cleanup 'uptodate' related code (take 4)
  blk_end_request: remove/unexport end_that_request_* (take 4)
  blk_end_request: changing scsi (take 4)
  blk_end_request: add bidi completion interface (take 4)
  blk_end_request: changing ide-cd (take 4)
  blk_end_request: add callback feature (take 4)
  blk_end_request: changing ide normal caller (take 4)
  blk_end_request: changing cpqarray (take 4)
  blk_end_request: changing cciss (take 4)
  blk_end_request: changing ide-scsi (take 4)
  blk_end_request: changing s390 (take 4)
  blk_end_request: changing mmc (take 4)
  blk_end_request: changing i2o_block (take 4)
  blk_end_request: changing viocd (take 4)
  blk_end_request: changing xen-blkfront (take 4)
  blk_end_request: changing viodasd (take 4)
  blk_end_request: changing sx8 (take 4)
  ...
2008-01-29 08:51:32 +11:00
Jens Axboe
febffd6181 cfq-iosched: kill some big inlines
Use of inlines were a bit over the top, trim them down a bit.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 13:19:43 +01:00
Jens Axboe
0871714e08 cfq-iosched: relax IOPRIO_CLASS_IDLE restrictions
Currently you must be root to set idle io prio class on a process. This
is due to the fact that the idle class is implemented as a true idle
class, meaning that it will not make progress if someone else is
requesting disk access. Unfortunately this means that it opens DOS
opportunities by locking down file system resources, hence it is root
only at the moment.

This patch relaxes the idle class a little, by removing the truly idle
part (which entals a grace period with associated timer). The
modifications make the idle class as close to zero impact as can be done
while still guarenteeing progress. This means we can relax the root only
criteria as well.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 11:38:15 +01:00
James Bottomley
fa0ccd837e block: implement drain buffers
These DMA drain buffer implementations in drivers are pretty horrible
to do in terms of manipulating the scatterlist.  Plus they're being
done at least in drivers/ide and drivers/ata, so we now have code
duplication.

The one use case for this, as I understand it is AHCI controllers doing
PIO mode to mmc devices but translating this to DMA at the controller
level.

So, what about adding a callback to the block layer that permits the
adding of the drain buffer for the problem devices.  The idea is that
you'd do this in slave_configure after you find one of these devices.

The beauty of doing it in the block layer is that it quietly adds the
drain buffer to the end of the sg list, so it automatically gets mapped
(and unmapped) without anything unusual having to be done to the
scatterlist in driver/scsi or drivers/ata and without any alteration to
the transfer length.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:54:11 +01:00
Jens Axboe
521f3bbdba io_context sharing - anticipatory changes
changes to anticipatory io scheduler for io_context sharing

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:35 +01:00
Jens Axboe
4ac845a2e9 block: cfq: make the io contect sharing lockless
The io context sharing introduced a per-ioc spinlock, that would protect
the cfq io context lookup. That is a regression from the original, since
we never needed any locking there because the ioc/cic were process private.

The cic lookup is changed from an rbtree construct to a radix tree, which
we can then use RCU to make the reader side lockless. That is the performance
critical path, modifying the radix tree is only done on process creation
(when that process first does IO, actually) and on process exit (if that
process has done IO).

As it so happens, radix trees are also much faster for this type of
lookup where the key is a pointer. It's a very sparse tree.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:33 +01:00
Nikanth Karthikesan
66dac98ed0 io_context sharing - cfq changes
changes in the cfq for io_context sharing

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:32 +01:00
Jens Axboe
d38ecf935f io context sharing: preliminary support
Detach task state from ioc, instead keep track of how many processes
are accessing the ioc.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:31 +01:00
Jens Axboe
fd0928df98 ioprio: move io priority from task_struct to io_context
This is where it belongs and then it doesn't take up space for a
process that doesn't do IO.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:29 +01:00
Kiyoshi Ueda
b8286239dd blk_end_request: cleanup of request completion (take 4)
This patch merges complete_request() into end_that_request_last()
for cleanup.

complete_request() was introduced by earlier part of this patch-set,
not to break the existing users of end_that_request_last().

Since all users are converted to blk_end_request interfaces and
end_that_request_last() is no longer exported, the code can be
merged to end_that_request_last().

Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:37:15 +01:00
Kiyoshi Ueda
5450d3e1d6 blk_end_request: cleanup 'uptodate' related code (take 4)
This patch converts 'uptodate' arguments of no longer exported
interfaces, end_that_request_first/last, to 'error', and removes
internal conversions for it in blk_end_request interfaces.

Also, this patch removes no longer needed end_io_error().

Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:37:13 +01:00
Kiyoshi Ueda
3bcddeac1c blk_end_request: remove/unexport end_that_request_* (take 4)
This patch removes the following functions:
  o end_that_request_first()
  o end_that_request_chunk()
and stops exporting the functions below:
  o end_that_request_last()

Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:37:12 +01:00
Kiyoshi Ueda
e3a04fe34a blk_end_request: add bidi completion interface (take 4)
This patch adds a variant of the interface, blk_end_bidi_request(),
which completes a bidi request.

Bidi request must be completed as a whole, both rq and rq->next_rq
at once.  So the interface has 2 arguments for completion size.

As for ->end_io, only rq->end_io is called (rq->next_rq->end_io is not
called).  So if special completion handling is needed, the handler
must be set to rq->end_io.
And the handler must take care of freeing next_rq too, since
the interface doesn't care of it if rq->end_io is not NULL.

Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:37:08 +01:00
Kiyoshi Ueda
e19a3ab058 blk_end_request: add callback feature (take 4)
This patch adds a variant of the interface, blk_end_request_callback(),
which has driver callback feature.

Drivers may need to do special works between end_that_request_first()
and end_that_request_last().
For such drivers, blk_end_request_callback() allows it to pass
a callback function which is called between end_that_request_first()
and end_that_request_last().

This interface is only for fallback of other blk_end_request interfaces.
Drivers should avoid their tricky behaviors and use other interfaces
as much as possible.

Currently, only one driver, ide-cd, needs this interface.
So this interface should/will be removed, after the driver removes
such tricky behaviors.

o ide-cd (cdrom_newpc_intr())
  In PIO mode, cdrom_newpc_intr() needs to defer end_that_request_last()
  until the device clears DRQ_STAT and raises an interrupt after
  end_that_request_first().
  So end_that_request_first() and end_that_request_last() are called
  separately in cdrom_newpc_intr().

  This means blk_end_request_callback() has to return without
  completing request even if no leftover in the request.
  To satisfy the requirement, callback function has return value
  so that drivers can tell blk_end_request_callback() to return
  without completing request.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:37:04 +01:00
Kiyoshi Ueda
9e6e39f2c4 blk_end_request: changing block layer core (take 4)
This patch converts core parts of block layer to use blk_end_request
interfaces.  Related 'uptodate' arguments are converted to 'error'.

'dequeue' argument was originally introduced for end_dequeued_request(),
where no attempt should be made to dequeue the request as it's already
dequeued.
However, it's not necessary as it can be checked with
list_empty(&rq->queuelist).
(Dequeued request has empty list and queued request doesn't.)
And it has been done in blk_end_request interfaces.

As a result of this patch, end_queued_request() and
end_dequeued_request() become identical.  A future patch will merge
and rename them and change users of those functions.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:35:57 +01:00
Kiyoshi Ueda
3b11313a6c blk_end_request: add/export functions to get request size (take 4)
This patch adds/exports functions to get the size of request in bytes.
They are useful because blk_end_request interfaces take bytes
as a completed I/O size instead of sectors.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:35:56 +01:00
Kiyoshi Ueda
336cdb4003 blk_end_request: add new request completion interface (take 4)
This patch adds 2 new interfaces for request completion:
  o blk_end_request()   : called without queue lock
  o __blk_end_request() : called with queue lock held

blk_end_request takes 'error' as an argument instead of 'uptodate',
which current end_that_request_* take.
The meanings of values are below and the value is used when bio is
completed.
    0 : success
  < 0 : error

Some device drivers call some generic functions below between
end_that_request_{first/chunk} and end_that_request_last().
  o add_disk_randomness()
  o blk_queue_end_tag()
  o blkdev_dequeue_request()
These are called in the blk_end_request interfaces as a part of
generic request completion.
So all device drivers become to call above functions.
To decide whether to call blkdev_dequeue_request(), blk_end_request
uses list_empty(&rq->queuelist) (blk_queued_rq() macro is added for it).
So drivers must re-initialize it using list_init() or so before calling
blk_end_request if drivers use it for its specific purpose.
(Currently, there is no driver which completes request without
 re-initializing the queuelist after used it.  So rq->queuelist
 can be used for the purpose above.)

"Normal" drivers can be converted to use blk_end_request()
in a standard way shown below.

 a) end_that_request_{chunk/first}
    spin_lock_irqsave()
    (add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
    end_that_request_last()
    spin_unlock_irqrestore()
    => blk_end_request()

 b) spin_lock_irqsave()
    end_that_request_{chunk/first}
    (add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
    end_that_request_last()
    spin_unlock_irqrestore()
    => spin_lock_irqsave()
       __blk_end_request()
       spin_unlock_irqsave()

 c) spin_lock_irqsave()
    (add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
    end_that_request_last()
    spin_unlock_irqrestore()
    => blk_end_request()   or   spin_lock_irqsave()
                                __blk_end_request()
                                spin_unlock_irqrestore()

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:35:53 +01:00
Christof Schmitt
6da127ad09 blktrace: Add blktrace ioctls to SCSI generic devices
Since the SCSI layer uses the request queues from the block layer, blktrace can
also be used to trace the requests to all SCSI devices (like SCSI tape drives),
not only disks. The only missing part is the ioctl interface to start and stop
tracing.

This patch adds the SETUP, START, STOP and TEARDOWN ioctls from blktrace to the
sg device files. With this change, blktrace can be used for SCSI devices like
for disks, e.g.: blktrace -d /dev/sg1 -o - | blkparse -i -

Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:04:46 +01:00
Linus Torvalds
9b73e76f3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 commits)
  [SCSI] usbstorage: use last_sector_bug flag universally
  [SCSI] libsas: abstract STP task status into a function
  [SCSI] ultrastor: clean up inline asm warnings
  [SCSI] aic7xxx: fix firmware build
  [SCSI] aacraid: fib context lock for management ioctls
  [SCSI] ch: remove forward declarations
  [SCSI] ch: fix device minor number management bug
  [SCSI] ch: handle class_device_create failure properly
  [SCSI] NCR5380: fix section mismatch
  [SCSI] sg: fix /proc/scsi/sg/devices when no SCSI devices
  [SCSI] IB/iSER: add logical unit reset support
  [SCSI] don't use __GFP_DMA for sense buffers if not required
  [SCSI] use dynamically allocated sense buffer
  [SCSI] scsi.h: add macro for enclosure bit of inquiry data
  [SCSI] sd: add fix for devices with last sector access problems
  [SCSI] fix pcmcia compile problem
  [SCSI] aacraid: add Voodoo Lite class of cards.
  [SCSI] aacraid: add new driver features flags
  [SCSI] qla2xxx: Update version number to 8.02.00-k7.
  [SCSI] qla2xxx: Issue correct MBC_INITIALIZE_FIRMWARE command.
  ...
2008-01-25 17:19:08 -08:00
Greg Kroah-Hartman
f9cb074bff Kobject: rename kobject_init_ng() to kobject_init()
Now that the old kobject_init() function is gone, rename
kobject_init_ng() to kobject_init() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman
b2d6db5878 Kobject: rename kobject_add_ng() to kobject_add()
Now that the old kobject_add() function is gone, rename kobject_add_ng()
to kobject_add() to clean up the namespace.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman
d5a379f77b Kobject: convert block/ll_rw_blk.c to use kobject_init/add_ng()
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:36 -08:00
Greg Kroah-Hartman
29e3dd0df1 Kobject: convert block/elevator.c to use kobject_init/add_ng()
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:36 -08:00
Kay Sievers
edfaa7c365 Driver core: convert block from raw kobjects to core devices
This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one
directory. For compatibility /sys/block is created and contains symlinks
to the disks.

  /sys/class/block
  |-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
  |-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1
  |-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10
  |-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5
  |-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6
  |-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7
  |-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8
  |-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9
  `-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

  /sys/block/
  |-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
  `-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:36 -08:00
Greg Kroah-Hartman
830d3cfb16 kset: convert block_subsys to use kset_create
Dynamically create the kset instead of declaring it statically.  We also
rename block_subsys to block_kset to catch all users of this symbol
with a build error instead of an easy-to-ignore build warning.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:23 -08:00
Greg Kroah-Hartman
3514faca19 kobject: remove struct kobj_type from struct kset
We don't need a "default" ktype for a kset.  We should set this
explicitly every time for each kset.  This change is needed so that we
can make ksets dynamic, and cleans up one of the odd, undocumented
assumption that the kset/kobject/ktype model has.

This patch is based on a lot of help from Kay Sievers.

Nasty bug in the block code was found by Dave Young
<hidave.darkstar@gmail.com>

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:10 -08:00
James Bottomley
11c3e689f1 [SCSI] block: Introduce new blk_queue_update_dma_alignment interface
The purpose of this is to allow stacked alignment settings, with the
ultimate queue alignment being set to the largest alignment requirement
in the stack.

The reason for this is so that the SCSI mid-layer can relax the default
alignment requirements (which are basically causing a lot of superfluous
copying to go on in the SG_IO interface) while allowing transports,
devices or HBAs to add stricter limits if they need them.

Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-01-11 18:29:20 -06:00
James Bottomley
2d507a01da [SCSI] libsas, bsg: pass errors through correctly
Currently in BSG, errors returned in req->errors aren't passed back to
the calling programme (either via SG_IO or via read/write).  Fix this,
while preserving the SCSI convention of returning status in
req->errors.

Now update libsas to return errors correctly instead of to ignore
them.

Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-01-11 18:29:13 -06:00
Jens Axboe
11a57153e3 blktrace: kill the unneeded initcall
It just inits the mutex, we can do that with DEFINE_MUTEX() instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-11 13:37:01 +01:00
Ingo Molnar
2997c8c4a0 block: fix blktrace timestamps
David Dillow reported broken blktrace timestamps. The reason
is cpu_clock() which is not a global time source.

Fix bkltrace timestamps by using ktime_get() like the networking
code does for packet timestamps. This also removes a whole lot
of complexity from bkltrace.c and shrinks the code by 500 bytes:

   text    data     bss     dec     hex filename
   2888     124      44    3056     bf0 blktrace.o.before
   2390     116      44    2550     9f6 blktrace.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-11 13:35:54 +01:00
Adrian Bunk
2fdd82bd88 block: let elv_register() return void
elv_register() always returns 0, and there isn't anything it does where
it should return an error (the only error condition is so grave that
it's handled with a BUG_ON).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18 08:29:28 +01:00
Aaron Carroll
49565124b1 as-iosched: fix write batch start point
New write batches currently start from where the last one completed.
We have no idea where the head is after switching batches, so this
makes little sense.  Instead, start the next batch from the request
with the earliest deadline in the hope that we avoid a deadline
expiry later on.

Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18 08:29:28 +01:00
Aaron Carroll
8896f3c039 as-iosched: fix incorrect comments
Two comments refer to deadlines applying to reads only.  This is
not the case.

Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18 08:29:28 +01:00
Tejun Heo
24bb8fb99a block: use jiffies conversion functions in scsi_ioctl.c
Use msecs_to_jiffies() and jiffies_to_msecs() in scsi_ioctl().
Sometimes callers use very large values for e.g. vendor specific media
clear command and calculation can overflow.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-12-18 08:29:28 +01:00
Jens Axboe
7c9f29b128 Revert "ll_rw_blk: temporarily enable max_segments tweaking"
This was a temporary debugging thing for sg chaining testing, revert
it now as it has served its purpose.

This reverts commit 563063a808.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-27 09:23:51 +01:00
Jerome Marchand
c7674030e5 block: Fix memory leak in alloc_disk_node()
Fix a memory leak in alloc_disk_node(). Don't forget to free 'dkstats' when the allocation of 'part' failed.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-27 09:19:40 +01:00
Aneesh Kumar K.V
35fc51e7a5 blktrace: Make sure BLKTRACETEARDOWN does the full cleanup.
if blktrace program segfault it will not be able
to call BLKTRACETEARDOWN. Now if we run the blktrace
again that would result in a failure to create the
block/<device> debugfs directory.This will result
in blk_remove_root() to be called which will set
blk_tree_root to NULL. But the  debugfs block dir
still exist because it contain subdirectory.

Now if we try to fix it using BLKTRACETEARDOWN
it won't work because blk_tree_root is NULL.

Fix the same.

Tested as below

root@qemu-image:/home/kvaneesh/blktrace# ./blktrace  -d /dev/hdc
Segmentation fault
root@qemu-image:/home/kvaneesh/blktrace# ./blktrace  -d /dev/hdc
BLKTRACESETUP: No such file or directory
Failed to start trace on /dev/hdc
root@qemu-image:/home/kvaneesh/blktrace# ./blktrace  -k /dev/hdc
root@qemu-image:/home/kvaneesh/blktrace# ./blktrace  -d /dev/hdc

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-27 09:19:39 +01:00
Alan D. Brunelle
2ad8b1ef11 Add UNPLUG traces to all appropriate places
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.

Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-09 13:41:32 +01:00
Jens Axboe
d85532ed28 block: fix requeue handling in blk_queue_invalidate_tags()
Credit goes to juergen.kadidlo@exasol.com for diagnosing this issue
and supplying the initial patch.

blk_queue_invalidate_tags() must use the proper requeueing paths instead
of open coding the re-add of the request, otherwise we bug out in rq
accounting. Just switch to using blk_requeue_request(), that takes care
of end-tag handling as well and also adds the blktrace REQUEUE notify
event that is also appropriate here.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-09 12:52:45 +01:00
Oleg Nesterov
0e7be9edb9 cfq_idle_class_timer: add paranoid checks for jiffies overflow
In theory, if the queue was idle long enough, cfq_idle_class_timer may have
a false (and very long) timeout because jiffies can wrap into the past wrt
->last_end_request.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-07 13:51:35 +01:00
Oleg Nesterov
b70c864d3c cfq: fix IOPRIO_CLASS_IDLE delays
After the fresh boot:

	ionice -c3 -p $$
	echo cfq >> /sys/block/XXX/queue/scheduler
	dd if=/dev/XXX of=/dev/null bs=512 count=1

Now dd hangs in D state and the queue is completely stalled for approximately
INITIAL_JIFFIES + CFQ_IDLE_GRACE jiffies. This is because cfq_init_queue()
forgets to initialize cfq_data->last_end_request.

(I guess this patch is not complete, overflow is still possible)

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-07 09:46:13 +01:00
Oleg Nesterov
2389d1ef17 cfq: fix IOPRIO_CLASS_IDLE accounting
Spotted by Nick <gentuu@gmail.com>, hopefully can explain the second trace in
http://bugzilla.kernel.org/show_bug.cgi?id=9180.

If ->async_idle_cfqq != NULL cfq_put_async_queues() puts it IOPRIO_BE_NR times
in a loop. Fix this.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-07 09:45:00 +01:00
Linus Torvalds
b4f555081f Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  [BLOCK] Don't allow empty barriers to be passed down to queues that don't grok them
  dm: bounce_pfn limit added
  Deadline iosched: Fix batching fairness
  Deadline iosched: Reset batch for ordered requests
  Deadline iosched: Factor out finding latter reques
2007-11-03 12:43:36 -07:00
Jens Axboe
51fd77bd9f [BLOCK] Don't allow empty barriers to be passed down to queues that don't grok them
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:49:08 +01:00
Aaron Carroll
6f5d8aa638 Deadline iosched: Fix batching fairness
After switching data directions, deadline always starts the next batch
from the lowest-sector request.  This gives excessive deadline expiries
and large latency and throughput disparity between high- and low-sector
requests; an order of magnitude in some tests.

This patch changes the batching behaviour so new batches start from the
request whose expiry is earliest.

Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:25 +01:00
Aaron Carroll
dfb3d72a9a Deadline iosched: Reset batch for ordered requests
The deadline I/O scheduler does not reset the batch count when starting
a new batch at a higher-sectored request.  This means the second and
subsequent batch in the same data direction will never exceed a single
request in size whenever higher-sectored requests are pending.

This patch gives new batches in the same data direction as old ones
their full quota of requests by resetting the batch count.

Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:25 +01:00
Aaron Carroll
5d1a536621 Deadline iosched: Factor out finding latter reques
Factor finding the next request in sector-sorted order into
a function deadline_latter_request.

Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:25 +01:00
Jens Axboe
c46f2334c8 [SG] Get rid of __sg_mark_end()
sg_mark_end() overwrites the page_link information, but all users want
__sg_mark_end() behaviour where we just set the end bit. That is the most
natural way to use the sg list, since you'll fill it in and then mark the
end point.

So change sg_mark_end() to only set the termination bit. Add a sg_magic
debug check as well, and clear a chain pointer if it is set.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:06 +01:00
Philip Langdale
33013a8811 compat_ioctl: fix block device compat ioctl regression
The conversion of handlers to compat_blkdev_ioctl accidentally
disabled handling of most ioctl numbers on block devices because
of a typo. Fix the one line to enable it all again.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2007-10-29 11:33:06 +01:00
Jens Axboe
6eca9004df [BLOCK] Fix bad sharing of tag busy list on queues with shared tag maps
For the locking to work, only the tag map and tag bit map may be shared
(incidentally, I was just explaining this to Nick yesterday, but I
apparently didn't review the code well enough myself). But we also share
the busy list!  The busy_list must be queue private, or we need a
block_queue_tag covering lock as well.

So we have to move the busy_list to the queue. This'll work fine, and
it'll actually also fix a problem with blk_queue_invalidate_tags() which
will invalidate tags across all shared queues. This is a bit confusing,
the low level driver should call it for each queue seperately since
otherwise you cannot kill tags on just a single queue for eg a hard
drive that stops responding. Since the function has no callers
currently, it's not an issue.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:06 +01:00
Nick Piggin
adb4ddbbfb block: use lock bitops for the tag map.
The block queue tag map can use lock bitops.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:06 +01:00
Oleg Nesterov
0a0836a09c cfq_get_queue: fix possible NULL pointer access
cfq_get_queue()->cfq_find_alloc_queue() can fail, check the returned value.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

Note that this isn't a bug at the moment, since the regular IO path
does not call this path without __GFP_WAIT set. However, it could be a
future bug, so I've applied it.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:05 +01:00
Oleg Nesterov
abbeb88d00 blk_sync_queue() should cancel request_queue->unplug_work
blk_sync_queue() cancels the timer, but forgets to cancel the work.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:05 +01:00
Oleg Nesterov
4310864b9d cfq_exit_queue() should cancel cfq_data->unplug_work
Spotted by Nick <gentuu@gmail.com>, perhaps explains the first trace in
http://bugzilla.kernel.org/show_bug.cgi?id=9180.

cfq_exit_queue() should cancel cfqd->unplug_work before freeing cfqd.
blk_sync_queue() seems unneeded, removed.

Q: why cfq_exit_queue() calls cfq_shutdown_timer_wq() twice?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:05 +01:00
Jerome Marchand
b238b3d4be block layer: remove a unused argument of drive_stat_acct()
The nr_sector argument of drive_stat_acct() is not used anymore since the read and write sectors statistics are now updated in end_that_request_first(). This patch removes the useless argument.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:05 +01:00
Jens Axboe
642f149031 SG: Change sg_set_page() to take length and offset argument
Most drivers need to set length and offset as well, so may as well fold
those three lines into one.

Add sg_assign_page() for those two locations that only needed to set
the page, where the offset/length is set outside of the function context.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-24 11:20:47 +02:00
Jens Axboe
7aeacf9822 [BLOCK] blk_rq_map_sg: force clear termination bit
Since blk_rq_map_sg() sets the termination bit at the end of the sg
table, we could see it prematurely on the next mapping unless we
force drivers to do a full sg_init_table() prior to each mapping. So
force clear the termination bit to avoid having to put that clear in
the driver for every mapping.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-23 09:49:25 +02:00
Jens Axboe
ad0d4083e6 [BLOCK] Don't clear sg_dma_len/addr() in blk_rq_map_sg()
It's not a proper lvalue on all archs.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-23 09:27:05 +02:00
Jens Axboe
9b61764bcb [SG] Update block layer to use sg helpers
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-22 19:39:33 +02:00
Uwe Kleine-König
dbe7f76dd6 fix typo "insted" -> "instead"
Signed-off-by: Uwe Kleine-König <ukleinek@informatik.uni-freiburg.de>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-20 01:55:04 +02:00
Pavel Emelyanov
ba25f9dcc4 Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:43 -07:00
Randy Dunlap
8f731f7d83 kernel-api docbook: fix content problems
Fix kernel-api docbook contents problems.

docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: 			of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:35 -07:00
Jens Axboe
ba951841ce [BLOCK] blk_rq_map_sg() next_sg fixup
Don't ever use sg_next() on the last entry, it may not be valid!

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-17 19:34:11 +02:00
Linus Torvalds
b6257a9036 Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block:
  [SCSI] Remove full sg table memset()
  [SCSI] ide-scsi: remove usage of sg_last()
  Fix loop terminating conditions in fill_sg().
  [BLOCK] Clear sg entry before filling in blk_rq_map_sg()
  IA64: iommu uses sg_next with an invalid sg element
  cciss: disable DMA refetch on Smart Array P600
  swiotlb: fix map_sg failure handling
  SPARC64: fix iommu sg chaining
  [SCSI] ide-scsi: use scsi_sg_count() instead of ->use_sg
2007-10-17 09:08:13 -07:00
Peter Zijlstra
e0bf68ddec mm: bdi init hooks
provide BDI constructor/destructor hooks

[akpm@linux-foundation.org: compile fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Jens Axboe
60573b874b [BLOCK] Clear sg entry before filling in blk_rq_map_sg()
The memset() of the sg entry was originally removed, because it could
overwrite a chain pointer. But it's quite OK to memset() it when we know
it's a valid entry, since it can't contain a chain pointer.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-17 13:02:33 +02:00
Linus Torvalds
92d15c2ccb Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: (63 commits)
  Fix memory leak in dm-crypt
  SPARC64: sg chaining support
  SPARC: sg chaining support
  PPC: sg chaining support
  PS3: sg chaining support
  IA64: sg chaining support
  x86-64: enable sg chaining
  x86-64: update pci-gart iommu to sg helpers
  x86-64: update nommu to sg helpers
  x86-64: update calgary iommu to sg helpers
  swiotlb: sg chaining support
  i386: enable sg chaining
  i386 dma_map_sg: convert to using sg helpers
  mmc: need to zero sglist on init
  Panic in blk_rq_map_sg() from CCISS driver
  remove sglist_len
  remove blk_queue_max_phys_segments in libata
  revert sg segment size ifdefs
  Fixup u14-34f ENABLE_SG_CHAINING
  qla1280: enable use_sg_chaining option
  ...
2007-10-16 10:09:16 -07:00
Fengguang Wu
f2e189827a readahead: remove the limit max_sectors_kb imposed on max_readahead_kb
Remove the size limit max_sectors_kb imposed on max_readahead_kb.

The size restriction is unreasonable.  Especially when max_sectors_kb cannot
grow larger than max_hw_sectors_kb, which can be rather small for some disk
drives.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Mike Travis
d5a7430ddc Convert cpu_sibling_map to be a per cpu variable
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable.  This saves sizeof(cpumask_t) * NR unused cpus.  Access is mostly
from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Jens Axboe
3eed13fd93 Merge branch 'sglist-arch' into for-linus 2007-10-16 12:29:34 +02:00
Jens Axboe
a39d113936 Merge branch 'barrier' into for-linus 2007-10-16 12:29:29 +02:00
Jens Axboe
563063a808 ll_rw_blk: temporarily enable max_segments tweaking
Expose this setting for now, so that users can play with enabling
large commands without defaulting it to on globally. This is a debug
patch, it will be dropped for the final versions.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:08:53 +02:00
Jens Axboe
f565913ef8 block: convert to using sg helpers
Convert the main rq mapper (blk_rq_map_sg()) to the sg helper setup.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:07:11 +02:00
Jens Axboe
fd5d806266 block: convert blkdev_issue_flush() to use empty barriers
Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:05:02 +02:00
Jens Axboe
bf2de6f5a4 block: Initial support for data-less (or empty) barrier support
This implements functionality to pass down or insert a barrier
in a queue, without having data attached to it. The ->prepare_flush_fn()
infrastructure from data barriers are reused to provide this
functionality.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:56 +02:00
Jens Axboe
c07e2b4129 block: factor our bio_check_eod()
End of device check is done twice in __generic_make_request() and it's
fully inlined each time.  Factor out bio_check_eod().

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:55 +02:00
Jens Axboe
a0cd128542 block: add end_queued_request() and end_dequeued_request() helpers
We can use this helper in the elevator core for BLKPREP_KILL, and it'll
also be useful for the empty barrier patch.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:53 +02:00
Jens Axboe
4fa253f33c block: ll_rw_blk.c: cosmetics
Fix ?: construct, a typo, whitespace, and similar.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:49 +02:00
Arjan van de Ven
7344be053a bsg: mark struct file_operations const
struct file_operations is generally const (to avoid false sharing and get compile time errors on accidental writing to this shared structure); bsg recently added one of these without the const keyword. Patch below marks it const....

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 09:59:54 +02:00
Linus Torvalds
2b9e0aae1d Only enable BLOCK_COMPAT if COMPAT is needed
IOW, it needs to depend on both CONFIG_BLOCK and CONFIG_COMPAT.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-12 17:58:36 -07:00
Linus Torvalds
efefc6eb38 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (75 commits)
  PM: merge device power-management source files
  sysfs: add copyrights
  kobject: update the copyrights
  kset: add some kerneldoc to help describe what these strange things are
  Driver core: rename ktype_edd and ktype_efivar
  Driver core: rename ktype_driver
  Driver core: rename ktype_device
  Driver core: rename ktype_class
  driver core: remove subsystem_init()
  sysfs: move sysfs file poll implementation to sysfs_open_dirent
  sysfs: implement sysfs_open_dirent
  sysfs: move sysfs_dirent->s_children into sysfs_dirent->s_dir
  sysfs: make sysfs_root a regular directory dirent
  sysfs: open code sysfs_attach_dentry()
  sysfs: make s_elem an anonymous union
  sysfs: make bin attr open get active reference of parent too
  sysfs: kill unnecessary NULL pointer check in sysfs_release()
  sysfs: kill unnecessary sysfs_get() in open paths
  sysfs: reposition sysfs_dirent->s_mode.
  sysfs: kill sysfs_update_file()
  ...
2007-10-12 15:49:37 -07:00
Greg Kroah-Hartman
7e7654a92a cdev: remove unneeded setting of cdev names
struct cdev does not need the kobject name to be set, as it is never
used.  This patch fixes up the few places it is set.


Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
2007-10-12 14:51:02 -07:00
Greg Kroah-Hartman
19c38de88a kobjects: fix up improper use of the kobject name field
A number of different drivers incorrect access the kobject name field
directly.  This is not correct as the name might not be in the array.
Use the proper accessor function instead.
2007-10-12 14:51:02 -07:00
Kay Sievers
7eff2e7a8b Driver core: change add_uevent_var to use a struct
This changes the uevent buffer functions to use a struct instead of a
long list of parameters. It does no longer require the caller to do the
proper buffer termination and size accounting, which is currently wrong
in some places. It fixes a known bug where parts of the uevent
environment are overwritten because of wrong index calculations.

Many thanks to Mathieu Desnoyers for finding bugs and improving the
error handling.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:01 -07:00
Jens Axboe
99874d5048 [BLOCK] Only include the compat ioctl code if CONFIG_BLOCK is set
Add an extra CONFIG_BLOCK_COMPAT that we can use in the Makefile

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-12 12:50:41 +02:00
Arnd Bergmann
1ca91cd033 compat_ioctl: move floppy handlers to block/compat_ioctl.c
The floppy ioctls are used by multiple drivers, so they should be
handled in a shared location. Also, add minor cleanups.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:26:00 +02:00
Arnd Bergmann
b3087cc4f3 compat_ioctl: move cdrom handlers to block/compat_ioctl.c
These are shared by all cd-rom drivers and should have common
handlers. Do slight cosmetic cleanups in the process.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:26:00 +02:00
Arnd Bergmann
18cf7f8723 compat_ioctl: move BLKPG handling to block/compat_ioctl.c
BLKPG is common to all block devices, so it should be handled
by common code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:26:00 +02:00
Arnd Bergmann
9617db085c compat_ioctl: move hdio calls to block/compat_ioctl.c
These are common to multiple block drivers, so they should
be handled by the block layer.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:26:00 +02:00
Arnd Bergmann
171044d449 compat_ioctl: handle blk_trace ioctls
blk_trace_setup is broken on x86_64 compat systems,
this makes the code work correctly on all 64 bit architectures
in compat mode.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:26:00 +02:00
Arnd Bergmann
7199d4cdd8 compat_ioctl: add compat_blkdev_driver_ioctl()
Handle those blockdev ioctl calls that are compatible
directly from the compat_blkdev_ioctl() function, instead
of having to go through the compat_ioctl hash lookup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:26:00 +02:00
Arnd Bergmann
f58c4c0a17 compat_ioctl: move common block ioctls to compat_blkdev_ioctl
Make compat_blkdev_ioctl and blkdev_ioctl reflect the respective
native versions. This is somewhat more efficient and makes it easier
to keep the two in sync.

Also get rid of the bogus handling for broken_blkgetsize and the
duplicate entry for BLKRASET.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:26:00 +02:00
NeilBrown
6712ecf8f6 Drop 'size' argument from bio_endio and bi_end_io
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
NeilBrown
5bb23a688b Don't decrement bi_size in bio_endio
The only caller of bio_endio that does not pass the full bi_size
is end_that_request_first.  Also, no ->bi_end_io method is really
interested in bi_size being decremented.

So move the decrement and related code into ll_rw_blk and merge it
with order_bio_endio to form req_bio_endio which does endio functionality
specific to request completion.

As some ->bi_end_io methods do check bi_size of 0, we set it thus for
now, but that will go in the next patch.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./block/ll_rw_blk.c |   42 +++++++++++++++++++++++++++---------------
 ./fs/bio.c          |   23 +++++++++++------------
 2 files changed, 38 insertions(+), 27 deletions(-)

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
NeilBrown
d24517d793 Remove flush_dry_bio_endio
The entire function of flush_dry_bio_endio is to undo the effects
of bio_endio (when called on a barrier request).  So remove the
function and the call to bio_endio.

This allows us to remove "bi_size" from "struct request_queue".

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./block/ll_rw_blk.c      |   39 ++-------------------------------------
 ./include/linux/blkdev.h |    1 -
 2 files changed, 2 insertions(+), 38 deletions(-)

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
Satyam Sharma
db47d47537 ll_rw_blk: blk_cpu_notifier should be __cpuinitdata
blk_cpu_notifier is marked as __devinitdata, but __devinitdata need not
be __init even if HOTPLUG_CPU=n, which wastes space. It should be marked
__cpuinitdata, and the callback itself as __cpuinit.

Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
Jens Axboe
6c92e699b5 Fixup rq_for_each_segment() indentation
Remove one level of nesting where appropriate.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
bc1c56fde6 Share code between init_request_from_bio and blk_rq_bio_prep
These have very similar functions and should share code where
possible.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
66846572bf Stop exporting blk_rq_bio_prep
blk_rq_bio_prep is exported for use in exactly
one place.  That place can benefit from using
the new blk_rq_append_bio instead.
So
  - change dm-emc to call blk_rq_append_bio
  - stop exporting blk_rq_bio_prep, and
  - initialise rq_disk in blk_rq_bio_prep,
       as dm-emc needs it.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
3001ca7712 New function blk_req_append_bio
ll_back_merge_fn is currently exported to SCSI where is it used,
together with blk_rq_bio_prep, in exactly the same way these
functions are used in __blk_rq_map_user.

So move the common code into a new function (blk_rq_append_bio), and
don't export ll_back_merge_fn any longer.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
5705f70217 Introduce rq_for_each_segment replacing rq_for_each_bio
Every usage of rq_for_each_bio wraps a usage of
bio_for_each_segment, so these can be combined into
rq_for_each_segment.

We define "struct req_iterator" to hold the 'bio' and 'index' that
are needed for the double iteration.

Signed-off-by: Neil Brown <neilb@suse.de>

Various compile fixes by me...

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
9dfa52831e Merge blk_recount_segments into blk_recalc_rq_segments
blk_recalc_rq_segments calls blk_recount_segments on each bio,
then does some extra calculations to handle segments that overlap
two bios.

If we merge the code from blk_recount_segments into
blk_recalc_rq_segments, we can process the whole request one bio_vec
at a time, and not need the messy cross-bio calculations.

Then blk_recount_segments can be implemented by calling
blk_recalc_rq_segments, passing it a simple on-stack request which
stores just the bio.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
Nick Piggin
dd941252a8 shared tag queue barrier comment
Should add some comments for the tag barriers (they won't be so important
if we can switch over to the explicit _lock bitops, but for now we should
make it clear).

Jens' original patch said a barrier after the test_and_clear_bit was also
required. I can't see why (and it would prevent the use of the _lock bitop).

Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
--
2007-09-14 13:56:47 -07:00
Jens Axboe
f3da54ba14 Fix race with shared tag queue maps
There's a race condition in blk_queue_end_tag() for shared tag maps,
users include stex (promise supertrak thingy) and qla2xxx.  The former
at least has reported bugs in this area, not sure why we haven't seen
any for the latter.  It could be because the window is narrow and that
other conditions in the qla2xxx code hide this.  It's a real bug,
though, as the stex smp users can attest.

We need to ensure two things - the tag bit clearing needs to happen
AFTER we cleared the tag pointer, as the tag bit clearing/setting is
what protects this map.  Secondly, we need to ensure that the visibility
of the tag pointer and tag bit clear are ordered properly.

[ I removed the SMP barriers - "test_and_clear_bit()" already implies
  all the required barriers.  -- Linus ]

Also see http://bugzilla.kernel.org/show_bug.cgi?id=7842

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-13 08:20:25 -07:00
Alan D. Brunelle
c7149d6bce Fix remap handling by blktrace
This patch provides more information concerning REMAP operations on block
IOs. The additional information provides clearer details at the user level,
and supports post-processing analysis in btt.

o  Adds in partition remaps on the same device.
o  Fixed up the remap information in DM to be in the right order
o  Sent up mapped-from and mapped-to device information

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-08-11 22:34:48 +02:00
FUJITA Tomonori
0c6a89ba64 [SCSI] bsg: update sg_io_v4 structure
This updates sg_io_v4 structure (based on Doug's RFC, release 1.3).

The major changes are:

- add dout_resid field
- increase tag size to 64 bits to comply with SAM-4 and SRP
- add dout_iovec_count and din_iovec_count

dout_iovec_count and din_iovec_count aren't supported now. I'm not
sure whether they will be supported or not but they were added for the
possible future changes.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-31 10:43:05 -05:00
Linus Torvalds
a6ce22a5f6 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6: (28 commits)
  [SCSI] mpt fusion: Changes in mptctl.c for logging support
  [SCSI] mpt fusion: Changes in mptfc.c mptlan.c mptsas.c and mptspi.c for logging support
  [SCSI] mpt fusion: Changes in mptscsih.c for logging support
  [SCSI] mpt fusion: Changes in mptbase.c for logging support
  [SCSI] mpt fusion: logging support in Kconfig, Makefile, mptbase.h and addition of mptdebug.h
  [SCSI] libsas: Fix potential NULL dereference in sas_smp_get_phy_events()
  [SCSI] bsg: Fix build for CONFIG_BLOCK=n
  [SCSI] aacraid: fix Sunrise Lake reset handling
  [SCSI] aacraid: add SCSI SYNCHONIZE_CACHE range checking
  [SCSI] add easyRAID to the no report luns blacklist
  [SCSI] advansys: lindent and other large, uninteresting changes
  [SCSI] aic79xx, aic7xxx: Fix incorrect width setting
  [SCSI] qla2xxx: fix to honor ignored parameters in sysfs attributes
  [SCSI] aacraid: draw line in sand, sundry cleanup and version update
  [SCSI] iscsi_tcp: Turn off bounce buffers
  [SCSI] libiscsi: fix cmd seqeunce number checking
  [SCSI] iscsi_tcp, ib_iser Enable module refcounting for iscsi host template
  [SCSI] libiscsi: make sure session is not blocked when removing host
  [SCSI] libsas: Remove PCI dependencies
  [SCSI] simscsi: convert to use the data buffer accessors
  ...
2007-07-29 17:22:03 -07:00
Paul Mundt
99d4d0a9f2 [SCSI] bsg: Fix build for CONFIG_BLOCK=n
BLK_DEV_BSG was added outside of the if BLOCK check, which allows it to
be enabled when CONFIG_BLOCK=n. This leads to many screenlengths of
errors, starting with a parse error on the request_queue_t definition.
Obviously this wasn't intended for CONFIG_BLOCK=n usage, so just move the
option back in to the block.

Caught with a randconfig on sh.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-28 10:56:06 -04:00
Ingo Molnar
7c2ff389bb blktrace: use cpu_clock() instead of sched_clock()
use cpu_clock() instead of sched_clock(). (the latter is not a proper
clock-source)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-27 08:08:24 +02:00
Paul Mundt
4d5d8e9d3e bsg: Fix build for CONFIG_BLOCK=n
BLK_DEV_BSG was added outside of the if BLOCK check, which allows it to
be enabled when CONFIG_BLOCK=n. This leads to many screenlengths of
errors, starting with a parse error on the request_queue_t definition.
Obviously this wasn't intended for CONFIG_BLOCK=n usage, so just move the
option back in to the block.

Caught with a randconfig on sh.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

--

 block/Kconfig |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-27 08:08:24 +02:00
Jens Axboe
165125e1e4 [BLOCK] Get rid of request_queue_t typedef
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-24 09:28:11 +02:00
FUJITA Tomonori
1079ddcb07 [SCSI] bsg: remove unnecessary code and comments
- kill uhdr in bsg_command structure
- it's not necessary to put SG v4 stuff to block/scsi_ioctl.c

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-23 16:50:16 -05:00
FUJITA Tomonori
598443a212 [SCSI] bsg: use lib/idr.c to find a unique minor number
This replaces the current linear search for a unique minor number with
lib/idr.c.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-23 16:49:44 -05:00
Linus Torvalds
e6f194d8f6 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (60 commits)
  [SCSI] libsas: make ATA functions selectable by a config option
  [SCSI] bsg: unexport sg v3 helper functions
  [SCSI] bsg: fix bsg_unregister_queue
  [SCSI] bsg: make class backlinks
  [SCSI] 3w-9xxx: add support for 9690SA
  [SCSI] bsg: fix bsg_register_queue error path
  [SCSI] ESP: Increase ESP_BUS_TIMEOUT to 275.
  [SCSI] libsas: fix scr_read/write users and update the libata documentation
  [SCSI] mpt fusion: update Kconfig help
  [SCSI] scsi_transport_sas: add destructor for bsg
  [SCSI] iscsi_tcp: buggered kmalloc()
  [SCSI] qla2xxx: Update version number to 8.02.00-k2.
  [SCSI] qla2xxx: Add ISP25XX support.
  [SCSI] qla2xxx: Use pci_try_set_mwi().
  [SCSI] qla2xxx: Use PCI-X/PCI-Express read control interfaces.
  [SCSI] qla2xxx: Re-factor isp_operations to static structures.
  [SCSI] qla2xxx: Validate mid-layer 'underflow' during check-condition handling.
  [SCSI] qla2xxx: Correct setting of 'current' and 'supported' speeds during FDMI registration.
  [SCSI] qla2xxx: Generalize iIDMA support.
  [SCSI] qla2xxx: Generalize FW-Interface-2 support.
  ...
2007-07-22 11:36:49 -07:00
FUJITA Tomonori
41e1703b9b [SCSI] bsg: unexport sg v3 helper functions
blk_fill_sghdr_rq, blk_unmap_sghdr_rq, and blk_complete_sghdr_rq were
exported for bsg, however bsg was changed to support only sg v4.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-22 08:48:41 -05:00
FUJITA Tomonori
df468820b6 [SCSI] bsg: fix bsg_unregister_queue
scsi_sysfs_add_sdev ignores the bsg_register_queue failure, so
bsg_unregister_queue must check whether the queue has a bsg device.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-21 08:58:41 -05:00
James Bottomley
39dca558a5 [SCSI] bsg: make class backlinks
Currently, bsg doesn't make class backlinks (a process whereby you'd get
a link to bsg in the device directory in the same way you get one for
sg).  This is because the bsg device is uninitialised, so the class
device has nothing it can attach to.  The fix is to make the bsg device
point to the cdevice of the entity creating the bsg, necessitating
changing the bsg_register_queue() prototype into a form that takes the
generic device.

Acked-by: FUJITA Tomonori <tomof@acm.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-21 08:58:23 -05:00
James Bottomley
6826ee4fdb [SCSI] bsg: fix bsg_register_queue error path
unfortunately, if IS_ERR(class_dev) is true, that means class_dev isn't
null and the check in the error leg is pointless ... it's also asking
for trouble to request unregistration of a device we haven't actually
created (although it works currently).  Fix by using explicit gotos and
unregisters.

Acked-by: FUJITA Tomonori <tomof@acm.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-21 08:53:33 -05:00
Alexey Dobriyan
8350163a90 cfq: Write-only stuff in CFQ data structures
There are some leftover bits from the task cooperator patch, that was
yanked out again. While it will get reintroduced, no point in having
this write-only stuff in the tree. So yank it.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-20 10:07:50 +02:00
Vasily Tarasov
c2dea2d1fd cfq: async queue allocation per priority
If we have two processes with different ioprio_class, but the same
ioprio_data, their async requests will fall into the same queue. I guess
such behavior is not expected, because it's not right to put real-time
requests and best-effort requests in the same queue.

The attached patch fixes the problem by introducing additional *cfqq
fields on cfqd, pointing to per-(class,priority) async queues.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-20 10:06:38 +02:00
Paul Mundt
20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
James Bottomley
80ed71ce1a [SCSI] bsg: separate bsg and SCSI (so SCSI can be modular)
This patch moves the bsg registration into SCSI so that bsg no longer
has a dependency on the scsi_interface_register API.

This can be viewed as a temporary expedient until we can get universal
bsg binding sorted out properly.  Also use the sdev bus_id as the
generic bsg name (to avoid clashes with the queue name).

Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-07-19 12:37:34 -05:00
Yoann Padioleau
dd00cc486a some kmalloc/memset ->kzalloc (tree wide)
Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).

Here is a short excerpt of the semantic patch performing
this transformation:

@@
type T2;
expression x;
identifier f,fld;
expression E;
expression E1,E2;
expression e1,e2,e3,y;
statement S;
@@

 x =
- kmalloc
+ kzalloc
  (E1,E2)
  ...  when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
- memset((T2)x,0,E1);

@@
expression E1,E2,E3;
@@

- kzalloc(E1 * E2,E3)
+ kcalloc(E1,E2,E3)

[akpm@linux-foundation.org: get kcalloc args the right way around]
Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg KH <greg@kroah.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:50 -07:00
Linus Torvalds
f3d9071667 Merge branch 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block
* 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block:
  bsg: fix missing space in version print
  Don't define empty struct bsg_class_device if !CONFIG_BLK_DEV_BSG
  bsg: Kconfig updates
  bsg: minor cleanup
  bsg: device hash table cleanup
  bsg: fix initialization error handling bugs
  bsg: mark FUJITA Tomonori as bsg maintainer
  bsg: convert to dynamic major
  bsg: address various review comments
2007-07-17 15:26:31 -07:00
Akinobu Mita
f4480240f7 unregister_blkdev(): return void
Put WARN_ON and fixed all callers of unregister_blkdev().  Now we can make
unregister_blkdev return void.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:03 -07:00
Akinobu Mita
294462a5c6 unregister_blkdev(): do WARN_ON on failure
When unregister_blkdev() has failed, something wrong happened.  This patch
adds WARN_ON to notify of such badness.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:03 -07:00
Christoph Lameter
94f6030ca7 Slab allocators: Replace explicit zeroing with __GFP_ZERO
kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
variant in the past.  But with __GFP_ZERO it is possible now to do zeroing
while allocating.

Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
we can.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
Jens Axboe
5d3a8cd34b bsg: fix missing space in version print
Tomo introduced a bug in his commit, removing the space between
"driver" and "version" in the init printk.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-17 15:10:09 +02:00
FUJITA Tomonori
319a7b7fb9 bsg: Kconfig updates
- add the detailed explanation.
- remove 'default y'.
- make 'EXPERIMENTAL' keyword visible to the user in menu.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-17 12:22:09 +02:00
FUJITA Tomonori
0ed081ce20 bsg: minor cleanup
- fix MODULE_DESCRIPTION typo.
- unify MODULE_DESCRIPTION and bsg_version.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-17 12:21:35 +02:00
FUJITA Tomonori
1c1133e1ff bsg: device hash table cleanup
- kill unused bsg_list_idx macro.
- add bsg_dev_idx_hash() that returns an appropriate hlist_head.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-17 12:21:15 +02:00
FUJITA Tomonori
9b9f770cef bsg: fix initialization error handling bugs
This fixes the following bugs and cleans up the initialization code:

- cdev_del is missing.
- unregister_chrdev_region should be used instead of unregister_chrdev.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-17 12:20:46 +02:00
Jens Axboe
46f6ef4afc bsg: convert to dynamic major
240 was hardcoded, that was clearly a dumb mistake. Convert bsg
to use alloc_chrdev_region() to retrieve a dynamic major.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-17 08:56:10 +02:00
Jens Axboe
25fd164303 bsg: address various review comments
This address most of the comments made by Andrew. The two remaining
are conversion to idr, and dynamic major.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-17 08:52:29 +02:00
Linus Torvalds
29417b899a Make BLK_DEV_BSG depend strictly on SCSI=y
The SCSI code can be compiled modular, but BLK_DEV_BSG currently cannot,
and depends on the SCSI layer.  So make sure that it depends on the SCSI
layer being compiled in, not just available as a module.

Noticed by Jeff Garzik and S.Çağlar Onur.

Cc: Jeff Garzik <jeff@garzik.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 16:53:22 -07:00
Linus Torvalds
abce891a10 Fix new generic block device SG compile
We had a merge issue with the "dentry" field going away from the
kobject, and being replaced by a sysfs_dirent field (named "sd")
instead.  That broke the BSG compile.

Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 11:18:23 -07:00
FUJITA Tomonori
58ff411e0d bsg: Kconfig updates
This updates bsg entry in Kconfig:

- bsg supports sg v4
- bsg depends on SCSI
- it might be better to mark it experimental for a while

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:47 +02:00
FUJITA Tomonori
15d10b611f bsg: add SCSI transport-level request support
This enables bsg to handle SCSI transport-level request like SAS
management protocol (SMP).

- add BSG_SUB_PROTOCOL_{SCSI_CMD, SCSI_TMF, SCSI_TRANSPORT} definitions.
- SCSI transport-level requests skip blk_verify_command().

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:47 +02:00
FUJITA Tomonori
2c9ecdf40a bsg: add bidi support
bsg uses the rq->next_rq pointer for a bidi request.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:47 +02:00
FUJITA Tomonori
abae1fde63 add a struct request pointer to the request structure
This adds a struct request pointer to the request structure for the
second data phase (bidi for now). A request queue supporting bidi
requests sets QUEUE_FLAG_BIDI. This prevents sending bidi requests to
a non-bidi queue.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
efba1a31f3 bsg: fix the deadlock on discarding done commands
The previous commit introduced a deadlock in discarding commands,
because we forget to unlock the bd spinlock.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
e7d7217324 bsg: fix a blocking read bug
This patch fixes a bug that read() returns ENODATA even with a
blocking file descriptor when there are no commands pending.

This also includes some cleanups.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
4cf0723ac8 bsg: minor bug fixes
This fixes the following minor issues:

- add EXPORT_SYMBOL_GPL for bsg_register_queue and
bsg_unregister_queue.

- shut up gcc warnings

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@nelson.home.kernel.dk>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
292b7f2712 improve bsg device allocation
This patch addresses on two issues on bsg device allocation.

- the current maxium number of bsg devices is 256. It's too small if
we allocate bsg devices to all SCSI devices, transport entities, etc.
This increses the maxium number to 32768 (taken from the sg driver).

- SCSI devices are dynamically added and removed. Currently, bsg can't
handle it well since bsd_device->minor is simply increased.

This is dependent on the patchset that I posted yesterday:

http://marc.info/?l=linux-scsi&m=117440208726755&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
4e2872d6b0 bind bsg to all SCSI devices
This patch binds bsg to all SCSI devices (their request queues) like
the current sg driver does. We can send SCSI commands to non disk and
cdrom scsi devices like OSD via bsg.

This patch removes bsg_register_queue from blk_register_queue so bsg
devices aren't bound to non SCSI block devices. If they want bsg, I'll
send a patch to do that.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
d351af01b9 bsg: bind bsg to request_queue instead of gendisk
This patch binds bsg devices to request_queue instead of gendisk. Any
objects (like transport entities) can define own request_handler and
create own bsg device.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
45e79a3acd bsg: add a request_queue argument to scsi_cmd_ioctl()
bsg uses scsi_cmd_ioctl() for some SCSI/sg ioctl
commands. scsi_cmd_ioctl() gets a request queue from a gendisk
arguement. This prevents bsg being bound to SCSI devices that don't
have a gendisk (like OSD). This adds a request_queue argument to
scsi_cmd_ioctl(). The SCSI/sg ioctl commands doesn't use a gendisk so
it's safe for any SCSI devices to use scsi_cmd_ioctl().

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:45 +02:00
FUJITA Tomonori
7e75d73080 bsg: simplify __bsg_alloc_command failpath
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:45 +02:00
Jens Axboe
264a047218 bsg: add cheasy error checks for sysfs stuff
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:45 +02:00
Jens Axboe
5309cb38de Add queue resizing support
Just get rid of the preallocated command map, use the slab cache
to get/free commands instead.

Original patch from FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>,
changed by me to not use a mempool.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:45 +02:00
Jens Axboe
2ef7086a20 bsg: silence a bogus gcc warning
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:45 +02:00
Jens Axboe
b711afa695 bsg: style cleanup
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:45 +02:00
FUJITA Tomonori
10e8855b94 bsg: add SG_IO to SG v4
This adds SG_IO support to SG v4.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:44 +02:00
FUJITA Tomonori
70e36eceaf bsg: replace SG v3 with SG v4
This patch replaces SG v3 in bsg with SG v4 (except for SG_IO).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:44 +02:00
FUJITA Tomonori
337ad41dea block: export blk_verify_command for SG v4
blk_fill_sghdr_rq doesn't work for SG v4 so verify_command needed to
be exported.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:44 +02:00
FUJITA Tomonori
9e69fbb537 bsg: minor cleanups
This just kills linux/config.h and dprintk warnings.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:44 +02:00
FUJITA Tomonori
ac6b91b803 block: changes for blk_rq_unmap_user new API
This converts block/scsi_ioctl.c use blk_rq_unmap_user new
API. blk_unmap_sghdr_rq is too simple and it might be better to remove
it.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:44 +02:00
Jens Axboe
3d6392cfbd bsg: support for full generic block layer SG v3
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:44 +02:00
Matthias Kaehlcke
70cee26e02 Use list_for_each_entry() instead of list_for_each() in the block device
elevator

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 13:43:32 +02:00
Jan Engelhardt
16ed002f22 Use menuconfigs instead of menus, so the whole menu can be disabled at once
instead of going through all options.

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 13:43:27 +02:00
Jens Axboe
15c31be4d5 cfq-iosched: fix async queue behaviour
With the cfq_queue hash removal, we inadvertently got rid of the
async queue sharing. This was not intentional, in fact CFQ purposely
shares the async queue per priority level to get good merging for
async writes.

So put some logic in cfq_get_queue() to track the shared queues.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 13:43:25 +02:00
Tejun Heo
f4b09303d0 [BLOCK] drop unnecessary bvec rewinding from flush_dry_bio_endio
Barrier bios are completed twice - once after the barrier write itself
is done and again after the whole sequence is complete.
flush_dry_bio_endio() is for the first completion.  It doesn't really
complete the bio.  It rewinds bvec and resets bio so that it can be
completed again when the whole barrier sequence is complete.

The bvec rewinding code has the following problems.

1. The rewinding code is wrong because filesystems may pass bvec with
   non zero bv_offset.

2. The block layer doesn't guarantee anything about the state of
   bvec array on request completion.  bv_offset and len are updated
   iff __end_that_request_first() completes the bvec partially.

Because of #2, #1 doesn't really matter (nobody cares whether bvec is
re-wound correctly or not) but then again by not doing unwinding at
all, we'll always give back the same bvec to the caller as full bvec
completion doesn't alter bvecs and the final completion is always full
completion.

Drop unnecessary rewinding code.

This is spotted by Neil Brown.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:03:33 +02:00
Jens Axboe
32eef96411 blk_hw_contig_segment(): bad segment size checks
Two bugs in there:

- The virt oversize check should use the current bio hardware back
  size and the next bio front size, not the same bio. Spotted by
  Neil Brown.

- The segment size check should add hw front sizes, not total bio
  sizes. Spotted by James Bottomley

Acked-by: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:03:32 +02:00
Tejun Heo
bc90ba093a block: always requeue !fs requests at the front
SCSI marks internal commands with REQ_PREEMPT and push it at the front
of the request queue using blk_execute_rq().  When entering suspended
or frozen state, SCSI devices are quiesced using
scsi_device_quiesce().  In quiesced state, only REQ_PREEMPT requests
are processed.  This is how SCSI blocks other requests out while
suspending and resuming.  As all internal commands are pushed at the
front of the queue, this usually works.

Unfortunately, this interacts badly with ordered requeueing.  To
preserve request order on requeueing (due to busy device, active EH or
other failures), requests are sorted according to ordered sequence on
requeue if IO barrier is in progress.

The following sequence deadlocks.

1. IO barrier sequence issues.

2. Suspend requested.  Queue is quiesced with part or all of IO
   barrier sequence at the front.

3. During suspending or resuming, SCSI issues internal command which
   gets deferred and requeued for some reason.  As the command is
   issued after the IO barrier in #1, ordered requeueing code puts the
   request after IO barrier sequence.

4. The device is ready to process requests again but still is in
   quiesced state and the first request of the queue isn't
   REQ_PREEMPT, so command processing is deadlocked -
   suspending/resuming waits for the issued request to complete while
   the request can't be processed till device is put back into
   running state by resuming.

This can be fixed by always putting !fs requests at the front when
requeueing.

The following thread reports this deadlock.

  http://thread.gmane.org/gmane.linux.kernel/537473

Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: David Greaves <david@dgreaves.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-15 16:12:20 -07:00
Kristen Carlson Accardi
8ce7ad7b2d genhd: send async notification on media change
Send an uevent to user space to indicate that a media change event has
occurred.

Signed-off-by: Kristen Carlson Accardi <kristen.c.accardi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:12 -07:00
Kristen Carlson Accardi
86ce18d7b7 genhd: expose AN to user space
Allow user space to determine if a disk supports Asynchronous Notification of
media changes.  This is done by adding a new sysfs file "capability_flags",
which is documented in (insert file name).  This sysfs file will export all
disk capabilities flags to user space.  We also define a new flag to define
the media change notification capability.

Signed-off-by: Kristen Carlson Accardi <kristen.c.accardi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Jens Axboe
f653c34dd3 ll_rw_blk: fix gcc 4.2 warning on current_io_context()
current_io_context() is both static and exported with EXPORT_SYMBOL().
As there are no users outside of ll_rw_blk.c itself, just kill the
export.

Problem reported by Martin Michlmayr <tbm@cyrius.com>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-15 10:44:15 -07:00
Neil Brown
d89d87965d When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.

As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.

As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL.  We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.

So, we choose to allow each thread to only be active in one
generic_make_request at a time.  If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.

As the list of pending bios is per-thread, there are no locking issues to
worry about.

I say above that it is "safe to delay any call...".  There are, however,
some behaviours of a make_request_fn which would make it unsafe.  These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.

These could include:
 - waiting for that call to finish and call it's bi_end_io function.
   md use to sometimes do this (marking the superblock dirty before
   completing a write) but doesn't any more
 - inspecting the bio for fields that generic_make_request might
   change, such as bi_sector or bi_bdev.  It is hard to see a good
   reason for this, and I don't think anyone actually does it.
 - inspecing the queue to see if, e.g. it is 'full' yet.  Again, I
   think this is very unlikely to be useful, or to be done.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>

Alasdair G Kergon <agk@redhat.com> said:

 I can see nothing wrong with this in principle.

 For device-mapper at the moment though it's essential that, while the bio
 mappings may now get delayed, they still get processed in exactly
 the same order as they were passed to generic_make_request().

 My main concern is whether the timing changes implicit in this patch
 will make the rare data-corrupting races in the existing snapshot code
 more likely. (I'm working on a fix for these races, but the unfinished
 patch is already several hundred lines long.)

 It would be helpful if some people on this mailing list would test
 this patch in various scenarios and report back.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-11 13:28:37 +02:00
Linus Torvalds
9a9136e270 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  sound: convert "sound" subdirectory to UTF-8
  MAINTAINERS: Add cxacru website/mailing list
  include files: convert "include" subdirectory to UTF-8
  general: convert "kernel" subdirectory to UTF-8
  documentation: convert the Documentation directory to UTF-8
  Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
  remove broken URLs from net drivers' output
  Magic number prefix consistency change to Documentation/magic-number.txt
  trivial: s/i_sem /i_mutex/
  fix file specification in comments
  drivers/base/platform.c: fix small typo in doc
  misc doc and kconfig typos
  Remove obsolete fat_cvf help text
  Fix occurrences of "the the "
  Fix minor typoes in kernel/module.c
  Kconfig: Remove reference to external mqueue library
  Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
  Correct comments in genrtc.c to refer to correct /proc file.
  Fix more "deprecated" spellos.
  Fix "deprecated" typoes.
  ...

Fix trivial comment conflict in kernel/relay.c.
2007-05-09 12:54:17 -07:00
Rafael J. Wysocki
8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Oleg Nesterov
28e53bddf8 unify flush_work/flush_work_keventd and rename it to cancel_work_sync
flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
(this was possible from the very beginnig, I missed this).  So we can unify
flush_work_keventd and flush_work.

Also, rename flush_work() to cancel_work_sync() and fix all callers.
Perhaps this is not the best name, but "flush_work" is really bad.

(akpm: this is why the earlier patches bypassed maintainers)

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Auke Kok <auke-jan.h.kok@intel.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Andrew Morton
19a75d83ff kblockd: use flush_work
Switch the kblockd flushing from a global flush to a more specific
flush_work().

(akpm: bypassed maintainers, sorry.  There are other patches which depend on
this)

Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Dave Gilbert
dd2a345f8f Display all possible partitions when the root filesystem failed to mount
Display all possible partitions when the root filesystem is not mounted.
This helps to track spell'o's and missing drivers.

Updated to work with newer kernels.

Example output:

VFS: Cannot open root device "foobar" or unknown-block(0,0)
Please append a correct "root=" boot option; here are the available partitions:
0800    8388608 sda driver: sd
  0801     192748 sda1
  0802    8193150 sda2
0810    4194304 sdb driver: sd
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

[akpm@linux-foundation.org: cleanups, fix printk warnings]
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Cc: Dave Gilbert <linux@treblig.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Michael Opdenacker
59c51591a0 Fix occurrences of "the the "
Signed-off-by: Michael Opdenacker <michael@free-electrons.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-05-09 08:57:56 +02:00
Linus Torvalds
02a93208ed Merge branch 'for-2.6.22' of git://git.kernel.dk/data/git/linux-2.6-block
* 'for-2.6.22' of git://git.kernel.dk/data/git/linux-2.6-block:
  [PATCH] ll_rw_blk: fix missing bounce in blk_rq_map_kern()
  [PATCH] splice: always call into page_cache_readahead()
  [PATCH] splice(): fix interaction with readahead
2007-05-08 11:34:52 -07:00
Nick Piggin
c6a632a2b6 as: fix antic_expire check
Fix units mismatch (jiffies vs msecs) in as-iosched.c, spotted by Xiaoning
Ding <dingxn@cse.ohio-state.edu>.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:03 -07:00
Mike Christie
821de3a27b [PATCH] ll_rw_blk: fix missing bounce in blk_rq_map_kern()
I think we might just need the blk_map_kern users now. For the async
execute I added the bounce code already and the block SG_IO has it
atleady. I think the blk_map_kern bounce code got dropped because we
thought the correct gfp_t would be passed in. But I think all we need is
the patch below and all the paths are take care of. The patch is not
tested. Patch was made against scsi-misc.

The last place that is sending non sg commands may just be md/dm-emc.c
but that is is just waiting on alasdair to take some patches that fix
that and a bunch of junk in there including adding bounce support. If
the patch below is ok though and dm-emc finally gets converted then it
will have sg and bonce buffer support.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-08 19:12:23 +02:00
Christoph Lameter
0a31bd5f2b KMEM_CACHE(): simplify slab cache creation
This patch provides a new macro

KMEM_CACHE(<struct>, <flags>)

to simplify slab creation. KMEM_CACHE creates a slab with the name of the
struct, with the size of the struct and with the alignment of the struct.
Additional slab flags may be specified if necessary.

Example

struct test_slab {
	int a,b,c;
	struct list_head;
} __cacheline_aligned_in_smp;

test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

will create a new slab named "test_slab" of the size sizeof(struct
test_slab) and aligned to the alignment of test slab.  If it fails then we
panic.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Peter Zijlstra
f98393a64c mm: remove destroy_dirty_buffers from invalidate_bdev()
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).

find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
	quilt add $file;
	sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Linus Torvalds
4f7a307dc6 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (87 commits)
  [SCSI] fusion: fix domain validation loops
  [SCSI] qla2xxx: fix regression on sparc64
  [SCSI] modalias for scsi devices
  [SCSI] sg: cap reserved_size values at max_sectors
  [SCSI] BusLogic: stop using check_region
  [SCSI] tgt: fix rdma transfer bugs
  [SCSI] aacraid: fix aacraid not finding device
  [SCSI] aacraid: Correct SMC products in aacraid.txt
  [SCSI] scsi_error.c: Add EH Start Unit retry
  [SCSI] aacraid: [Fastboot] Panics for AACRAID driver during 'insmod' for kexec test.
  [SCSI] ipr: Driver version to 2.3.2
  [SCSI] ipr: Faster sg list fetch
  [SCSI] ipr: Return better qc_issue errors
  [SCSI] ipr: Disrupt device error
  [SCSI] ipr: Improve async error logging level control
  [SCSI] ipr: PCI unblock config access fix
  [SCSI] ipr: Fix for oops following SATA request sense
  [SCSI] ipr: Log error for SAS dual path switch
  [SCSI] ipr: Enable logging of debug error data for all devices
  [SCSI] ipr: Add new PCI-E IDs to device table
  ...
2007-05-05 13:30:44 -07:00
Greg Kroah-Hartman
823bccfc40 remove "struct subsystem" as it is no longer needed
We need to work on cleaning up the relationship between kobjects, ksets and
ktypes.  The removal of 'struct subsystem' is the first step of this,
especially as it is not really needed at all.

Thanks to Kay for fixing the bugs in this patch.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 18:57:59 -07:00
Jens Axboe
07e4470805 Merge branch 'cfq' into for-linus 2007-04-30 09:09:27 +02:00
Jens Axboe
2a12dcd71a [PATCH] elevator: elv_list_lock does not need irq disabling
It's never grabbed from irq context, so just make it plain spin_lock().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:08:17 +02:00
Jens Axboe
597bc485d6 cfq-iosched: speedup cic rb lookup
We often lookup the same queue many times in succession, so cache
the last looked up queue to avoid browsing the rbtree.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:23 +02:00
Jens Axboe
4e521c27ee ll_rw_blk: add io_context private pointer
To be used by as/cfq as they see fit.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:23 +02:00
Vasily Tarasov
91fac317a3 cfq-iosched: get rid of cfqq hash
cfq hash is no more necessary.  We always can get cfqq from io context.
cfq_get_io_context_noalloc() function is introduced, because we don't
want to allocate cic on merging and checking may_queue.  In order to
identify sync queue we've used hash key = CFQ_KEY_ASYNC. Since hash is
eliminated we need to use other criterion: sync flag for queue is added.
In all places where we dig in rb_tree we're in current context, so no
additional locking is required.

Advantages of this patch: no additional memory for hash, no seeking in
hash, code is cleaner. But it is necessary now to seek cic in per-ioc
rbtree, but it is faster:
- most processes work only with few devices
- most systems have only few block devices
- it is a rb-tree

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>

Changes by me:

- Merge into CFQ devel branch
- Get rid of cfq_get_io_context_noalloc()
- Fix various bugs with dereferencing cic->cfqq[] with offset other
  than 0 or 1.
- Fix bug in cfqq setup, is_sync condition was reversed.
- Fix bug where only bio_sync() is used, we need to check for a READ too

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:23 +02:00
Jens Axboe
cc19747977 cfq-iosched: tighten queue request overlap condition
For tagged devices, allow overlap of requests if the idle window
isn't enabled on the current active queue.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:23 +02:00
Jens Axboe
3ed9a2965c cfq-iosched: improve sync vs async workloads
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:23 +02:00
Jens Axboe
1be92f2fc7 cfq-iosched: never allow an async queue idling
We don't enable it by default, don't let it get enabled during
runtime.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
20e493a8d0 cfq-iosched: get rid of ->dispatch_slice
We can track it fairly accurately locally, let the slice handling
take care of the rest.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
6084cdda0e cfq-iosched: don't pass unused preemption variable around
We don't use it anymore in the slice expiry handling.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
edd75ffd92 cfq-iosched: get rid of ->cur_rr and ->cfq_list
It's only used for preemption now that the IDLE and RT queues also
use the rbtree. If we pass an 'add_front' variable to
cfq_service_tree_add(), we can set ->rb_key to 0 to force insertion
at the front of the tree.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
67e6b49e39 cfq-iosched: slice offset should take ioprio into account
Use the max_slice-cur_slice as the multipler for the insertion offset.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
498d3aa2b4 [PATCH] cfq-iosched: style cleanups and comments
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
67060e3799 cfq-iosched: sort IDLE queues into the rbtree
Same treatment as the RT conversion, just put the sorted idle
branch at the end of the tree.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
0c534e0a46 cfq-iosched: sort RT queues into the rbtree
Currently CFQ does a linked insert into the current list for RT
queues. We can just factor the class into the rb insertion,
and then we don't have to treat RT queues in a special way. It's
faster, too.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:22 +02:00
Jens Axboe
cc09e2990f [PATCH] cfq-iosched: speed up rbtree handling
For cases where the rbtree is mainly used for sorting and min retrieval,
a nice speedup of the rbtree code is to maintain a cache of the leftmost
node in the tree.

Also spotted in the CFS CPU scheduler code.

Improved by Alan D. Brunelle <Alan.Brunelle@hp.com> by updating the
leftmost hint in cfq_rb_first() if it isn't set, instead of only
updating it on insert.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:21 +02:00
Jens Axboe
d9e7620e60 cfq-iosched: rework the whole round-robin list concept
Drawing on some inspiration from the CFS CPU scheduler design, overhaul
the pending cfq_queue concept list management. Currently CFQ uses a
doubly linked list per priority level for sorting and service uses.
Kill those lists and maintain an rbtree of cfq_queue's, sorted by when
to service them.

This unfortunately means that the ionice levels aren't as strong
anymore, will work on improving those later. We only scale the slice
time now, not the number of times we service. This means that latency
is better (for all priority levels), but that the distinction between
the highest and lower levels aren't as big.

The diffstat speaks for itself.

 cfq-iosched.c |  363 +++++++++++++++++---------------------------------
 1 file changed, 125 insertions(+), 238 deletions(-)

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:21 +02:00
Jens Axboe
1afba0451c cfq-iosched: minor updates
- Move the queue_new flag clear to when the queue is selected
- Only select the non-first queue in cfq_get_best_queue(), if there's
  a substantial difference between the best and first.
- Get rid of ->busy_rr
- Only select a close cooperator, if the current queue is known to take
  a while to "think".

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:21 +02:00
Jens Axboe
6d048f5310 cfq-iosched: development update
- Implement logic for detecting cooperating processes, so we
  choose the best available queue whenever possible.

- Improve residual slice time accounting.

- Remove dead code: we no longer see async requests coming in on
  sync queues. That part was removed a long time ago. That means
  that we can also remove the difference between cfq_cfqq_sync()
  and cfq_cfqq_class_sync(), they are now indentical. And we can
  kill the on_dispatch array, just make it a counter.

- Allow a process to go into the current list, if it hasn't been
  serviced in this scheduler tick yet.

Possible future improvements including caching the cfqq lookup
in cfq_close_cooperator(), so we don't have to look it up twice.
cfq_get_best_queue() should just use that last decision instead
of doing it again.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:21 +02:00
Jens Axboe
1e3335de05 cfq-iosched: improve preemption for cooperating tasks
When testing the syslet async io approach, I discovered that CFQ
sometimes didn't perform as well as expected. cfq_should_preempt()
needs to better check for cooperating tasks, so fix that by allowing
preemption of an equal priority queue if the recently queued request
is as good a candidate for IO as the one we are currently waiting for.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-04-30 09:01:21 +02:00
Jens Axboe
5044eed488 cfq-iosched: fix alias + front merge bug
There's a really rare and obscure bug in CFQ, that causes a crash in
cfq_dispatch_insert() due to rq == NULL.  One example of the resulting
oops is seen here:

	http://lkml.org/lkml/2007/4/15/41

Neil correctly diagnosed the situation for how this can happen: if two
concurrent requests with the exact same sector number (due to direct IO
or aliasing between MD and the raw device access), the alias handling
will add the request to the sortlist, but next_rq remains NULL.

Read the more complete analysis at:

	http://lkml.org/lkml/2007/4/25/57

This looks like it requires md to trigger, even though it should
potentially be possible to due with O_DIRECT (at least if you edit the
kernel and doctor some of the unplug calls).

The fix is to move the ->next_rq update to when we add a request to the
rbtree. Then we remove the possibility for a request to exist in the
rbtree code, but not have ->next_rq correctly updated.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-25 08:41:48 -07:00
Jens Axboe
a993800655 cfq-iosched: fix sequential write regression
We have a 10-15% performance regression for sequential writes on TCQ/NCQ
enabled drives in 2.6.21-rcX after the CFQ update went in.  It has been
reported by Valerie Clement <valerie.clement@bull.net> and the Intel
testing folks.  The regression is because of CFQ's now more aggressive
queue control, limiting the depth available to the device.

This patches fixes that regression by allowing a greater depth when only
one queue is busy.  It has been tested to not impact sync-vs-async
workloads too much - we still do a lot better than 2.6.20.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-20 22:56:29 -07:00
Alan Stern
44ec95425c [SCSI] sg: cap reserved_size values at max_sectors
This patch (as857) modifies the SG_GET_RESERVED_SIZE and
SG_SET_RESERVED_SIZE ioctls in the sg driver, capping the values at
the device's request_queue's max_sectors value.  This will permit
cdrecord to obtain a legal value for the maximum transfer length,
fixing Bugzilla #7026.

The patch also caps the initial reserved_size value.  There's no
reason to have a reserved buffer larger than max_sectors, since it
would be impossible to use the extra space.

The corresponding ioctls in the block layer are modified similarly,
and the initial value for the reserved_size is set as large as
possible.  This will effectively make it default to max_sectors.
Note that the actual value is meaningless anyway, since block devices
don't have a reserved buffer.

Finally, the BLKSECTGET ioctl is added to sg, so that there will be a
uniform way for users to determine the actual max_sectors value for
any raw SCSI transport.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Douglas Gilbert <dougg@torque.net>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2007-04-17 18:09:56 -04:00
Andrew Morton
2363cc0264 [PATCH] remove protection of LANANA-reserved majors
Revert all this.  It can cause device-mapper to receive a different major from
earlier kernels and it turns out that the Amanda backup program (via GNU tar,
apparently) checks major numbers on files when performing incremental backups.

Which is a bit broken of Amanda (or tar), but this feature isn't important
enough to justify the churn.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
Thibaut VARENE
1ffb96c587 make elv_register() output atomic
Booting 2.6.21-rc3-g45592145 I noticed the following on one of my
machines in the bootlog:

io scheduler noop registered<6>Time: jiffies clocksource has been installed.

io scheduler deadline registered (default)

Looking at block/elevator.c, it appears that elv_register() uses two
consecutive printks in a non-atomic way, leading to the above glitch. The
attached trivial patch fixes this issue, by using a single printk.

Signed-off-by: Thibaut VARENE <varenet@parisc-linux.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27 08:53:04 +02:00
Vasily Tarasov
f772b3d9ca block: blk_max_pfn is somtimes wrong
There is a small problem in handling page bounce.

At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
possible _number_ of a page frame, but the _amount_ of page frames.  For
example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
0xFFFF.

request_queue structure has a member q->bounce_pfn and queue needs bounce
pages for the pages _above_ this limit.  This routine is handled by
blk_queue_bounce(), where the following check is produced:

	if (q->bounce_pfn >= blk_max_pfn)
		return;

Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
equals 0x10000.  In such situation the check above fails and for each bio
we always fall down for iterating over pages tied to the bio.

I want to notice, that for quite a big range of device drivers (ide, md,
...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
bounce_pfn.  BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
then the check above doesn't fail.  But for other drivers, which obtain
reuired value from drivers, it fails.  For example sata_nv uses
ATA_DMA_MASK or dev->dma_mask.

I propose to use (max_pfn - 1) for blk_max_pfn.  And the same for
blk_max_low_pfn.  The patch also cleanses some checks related with
bounce_pfn.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-03-27 08:52:47 +02:00
Peter Zijlstra
6d740cd5b1 [PATCH] lockdep: annotate BLKPG_DEL_PARTITION
>=============================================
>[ INFO: possible recursive locking detected ]
>2.6.19-1.2909.fc7 #1
>---------------------------------------------
>anaconda/587 is trying to acquire lock:
> (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
>
>but task is already holding lock:
> (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
>
>other info that might help us debug this:
>1 lock held by anaconda/587:
> #0:  (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
>
>stack backtrace:
> [<c0405812>] show_trace_log_lvl+0x1a/0x2f
> [<c0405db2>] show_trace+0x12/0x14
> [<c0405e36>] dump_stack+0x16/0x18
> [<c043bd84>] __lock_acquire+0x116/0xa09
> [<c043c960>] lock_acquire+0x56/0x6f
> [<c05fb1fa>] __mutex_lock_slowpath+0xe5/0x24a
> [<c05fb380>] mutex_lock+0x21/0x24
> [<c04d82fb>] blkdev_ioctl+0x600/0x76d
> [<c04946b1>] block_ioctl+0x1b/0x1f
> [<c047ed5a>] do_ioctl+0x22/0x68
> [<c047eff2>] vfs_ioctl+0x252/0x265
> [<c047f04e>] sys_ioctl+0x49/0x63
> [<c0404070>] syscall_call+0x7/0xb

Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment
clarifying the bd_mutex locking, because I confused myself and initially
thought the lock order was wrong too.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:16 -08:00
Andrew Morton
b446b60e4e [PATCH] rework reserved major handling
Several people have reported failures in dynamic major device number handling
due to the recent changes in there to avoid handing out the local/experimental
majors.

Rolf reports that this is due to a gcc-4.1.0 bug.

The patch refactors that code a lot in an attempt to provoke the compiler into
behaving.

Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Jesper Juhl
a8e14b950c update I/O sched Kconfig help texts - CFQ is now default, not AS.
Change I/O scheduler description to correctly show CFQ as being the default
scheduler and not the anticipatory scheduler that previously was default.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-02-17 20:08:22 +01:00
Arjan van de Ven
2b8693c061 [PATCH] mark struct file_operations const 3
Many struct file_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:45 -08:00
Andrew Morton
fdf892be32 [PATCH] register_blkdev(): don't hand out the LOCAL/EXPERIMENTAL majors
As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=7922, dynamic
blockdev major allocation can hand out majors which LANANA has defined as
being for local/experimental use.

Cc: Torben Mathiasen <device@lanana.org>
Cc: Greg KH <greg@kroah.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tomas Klas <tomas.klas@mepatek.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Jens Axboe
9ede209e83 cfq-iosched: improve continue or break logic in cfq_dispatch
This improves performance considerably for sync requests when you
have command queuing enabled.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
28f95cbc3e cfq-iosched: remove the implicit queue kicking in slice expire
We only really need it for a process going away, so move it to
those locations.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
3c6bd2f879 cfq-iosched: check whether a queue timed out in accounting
Makes it more fair for the residual slice count.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
cb8874119e cfq-iosched: tweak the FIFO checking
We currently check the FIFO once per slice. Optimize that a bit and
only do it as the first thing for a new slice, so we don't end up
doing a single request and then seek to the FIFO requests.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
1792669cc1 cfq-iosched: don't pass in queue for cfq_arm_slice_timer()
It must always be the active queue, otherwise it's a bug. So just
use the active_queue, don't pass it in explicitly.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
c5b680f3b7 cfq-iosched: account for slice over/under time
If a slice uses less than it is entitled to (or perhaps more), include
that in the decision on how much time to give it the next time it
gets serviced.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
44f7c16065 cfq-iosched: defer slice activation to first request being active
This better matches what time the queue is actually spending doing
IO.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
99f9628aba [PATCH] cfq-iosched: use last service point as the fairness criteria
Right now we use slice_start, which gives async queues an unfair
advantage. Chance that to service_last, and base the resorter
on that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:45 +01:00
Jens Axboe
b0b8d74941 cfq-iosched: document the cfqq flags
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe
98e41c7dfc [PATCH] cfq-iosched: move on_rr check into cfq_resort_rr_list()
Move the on_rr check into cfq_resort_rr_list(), every call site
needs to check it anyway.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe
aaf1228ddf cfq-iosched: remove cfq_io_context last_queue
It hasn't been used for a while, kill it off and remove the old
if 0 code chunk.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe
783660b2f6 elevator: don't sort reads between writes
Don't allow elv_dispatch_sort() to mix reads and writes together,
it's rarely a good idea.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Jens Axboe
cad9751642 elevator: abstract out the activate and deactivate functions
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-02-11 23:14:44 +01:00
Linus Torvalds
c827ba4cb4 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC64]: Update defconfig.
  [SPARC64]: Add PCI MSI support on Niagara.
  [SPARC64] IRQ: Use irq_desc->chip_data instead of irq_desc->handler_data
  [SPARC64]: Add obppath sysfs attribute for SBUS and PCI devices.
  [PARTITION]: Add whole_disk attribute.
2007-02-11 11:37:45 -08:00
Mathieu Desnoyers
23c887522e [PATCH] Relay: add CPU hotplug support
Mathieu originally needed to add this for tracing Xen, but it's something
that's needed for any application that can be tracing while cpus are added.

unplug isn't supported by this patch.  The thought was that at minumum a new
buffer needs to be added when a cpu comes up, but it wasn't worth the effort
to remove buffers on cpu down since they'd be freed soon anyway when the
channel was closed.

[zanussi@us.ibm.com: avoid lock_cpu_hotplug deadlock]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:28 -08:00
Fabio Massimo Di Nitto
d18d7682c1 [PARTITION]: Add whole_disk attribute.
Some partitioning systems create special partitions that
span the entire disk.  One example are Sun partitions, and
this whole-disk partition exists to tell the firmware the
extent of the entire device so it can load the boot block
and do other things.

Such partitions should not be treated as normal partitions,
because all the other partitions overlap this whole-disk one.
So we'd see multiple instances of the same UUID etc. which
we do not want.  udev and friends can thus search for this
'whole_disk' attribute and use it to decide to ignore the
partition.

Signed-off-by: Fabio Massimo Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-10 23:50:00 -08:00
Neil Brown
387bb17374 [PATCH] md: fix various bugs with aligned reads in RAID5
It is possible for raid5 to be sent a bio that is too big for an underlying
device.  So if it is a READ that we pass stright down to a device, it will
fail and confuse RAID5.

So in 'chunk_aligned_read' we check that the bio fits within the parameters
for the target device and if it doesn't fit, fall back on reading through
the stripe cache and making lots of one-page requests.

Note that this is the earliest time we can check against the device because
earlier we don't have a lock on the device, so it could change underneath
us.

Also, the code for handling a retry through the cache when a read fails has
not been tested and was badly broken.  This patch fixes that code.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Kai" <epimetreus@fastmail.fm>
Cc: <stable@suse.de>
Cc: <org@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09 09:25:46 -08:00
Mike Christie
c0d4d573fe [PATCH] Fix SG_IO timeout jiffy conversion
Commit 85e04e371b cleaned up the timeout
conversion, but did it exactly the wrong way.  We get msecs from user
space, and should convert them into jiffies. Not the other way around.

Here is a fix with the overflow check sg.c has added in.  This fixes DVD
burnign with Nero.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
[ "you'll be wanting a comma there" - Andrew ]
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-29 20:32:03 -08:00
Linas Vepstas
95543179f1 [PATCH] elevator: move clearing of unplug flag earlier
A flag was recently added to the elevator code to avoid
performing an unplug when reuests are being re-queued.
The goal of this flag was to avoid a deep recursion that
can occur when re-queueing requests after a SCSI device/host
reset.  See http://lkml.org/lkml/2006/5/17/254

However, that fix added the flag near the bottom of a case
statement, where an earlier break (in an if statement) could
transport one out of the case, without setting the flag.
This patch sets the flag earlier in the case statement.

I re-discovered the deep recursion recently during testing;
I was told that it was a known problem, and the fix to it was
in the kernel I was testing. Indeed it was ... but it didn't
fix the bug. With the patch below, I no longer see the bug.

Signed-off by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 11:01:17 -08:00
Jens Axboe
ec8acb6904 [PATCH] cfq-iosched: merging problem
Two issues:

- The final return 1 should be a return 0, otherwise comparing cfqq is
  a noop.

- bio_sync() only checks the sync flag, while rq_is_sync() checks both
  for READ and sync. The latter is what we want. Expand the bio check
  to include reads, and relax the restriction to allow merging of async
  io into sync requests.

In the future we want to clean up the SYNC logic, right now it means
both sync request (such as READ and O_DIRECT WRITE) and unplug-on-issue.
Leave that for later.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-02 09:46:16 -08:00
Jens Axboe
719d34027e [PATCH] cfq-iosched: tighten allow merge criteria
The logic in cfq_allow_merge() wasn't clear enough - basically allow
merging for the same queues only.  Do a fast check for 'rq and bio both
sync/async' before doing the cfqq hash lookup.

This is verified to work with the fixed elv_try_merge() from commit
bb4067e341.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 14:13:08 -08:00
Randy Dunlap
af9997e426 [PATCH] fix kernel-doc warnings in 2.6.20-rc1
Fix kernel-doc warnings in 2.6.20-rc1.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Jens Axboe
bb4067e341 [PATCH] elevator: fixup typo in merge logic
The recent io scheduler allow_merge commit left the block layer with
no merging, oops. This patch fixes that up.

That means the CFQ change needs to be verified again, it might not fix
the original bug now.  But that's a seperate thing, I'll double check
that tomorrow.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-21 22:01:04 -08:00
Jens Axboe
da77526502 [PATCH] cfq-iosched: don't allow sync merges across queues
Currently we allow any merge, even if the io originates from different
processes. This can cause really bad starvation and unfairness, if those
ios happen to be synchronous (reads or direct writes).

So add a allow_merge hook to the io scheduler ops, so an io scheduler can
help decide whether a bio/process combination may be merged with an
existing request.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-20 11:04:12 +01:00
Jens Axboe
8e5cfc45e7 [PATCH] Fixup blk_rq_unmap_user() API
The blk_rq_unmap_user() API is not very nice. It expects the caller to
know that rq->bio has to be reset to the original bio, and it will
silently do nothing if that is not done. Instead make it explicit that
we need to pass in the first bio, by expecting a bio argument.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 11:12:46 +01:00
Jens Axboe
48785bb9fa [PATCH] __blk_rq_unmap_user() fails to return error
If the bio is user copied, the copy back could return -EFAULT. Make
sure we return any error seen during unmapping.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 11:07:59 +01:00
Jens Axboe
9c9381f942 [PATCH] __blk_rq_map_user() doesn't need to grab the queue_lock
It was for driver private back_merge_fn hooks, but they don't exist
anymore.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 08:34:17 +01:00
Jens Axboe
1aa4f24fe9 [PATCH] Remove queue merging hooks
We have full flexibility of merging parameters now, so we can remove the
hooks that define back/front/request merge strategies. Nobody is using
them anymore.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 08:33:11 +01:00
Jens Axboe
2985259b0e [PATCH] ->nr_sectors and ->hard_nr_sectors are not used for BLOCK_PC requests
It's a file system thing, for block requests the only size used in the
io paths is ->data_len as it is in bytes, not sectors.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-19 08:27:31 +01:00
Jens Axboe
c65fb61b3c [PATCH] Allow as-iosched to be unloaded
We implemented the missing bits to allow this some time ago, and
they are integrated in AS. So remove the __module_get() to allow
the module to be unloaded.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-13 13:25:18 +01:00
Jens Axboe
7749a8d423 [PATCH] Propagate down request sync flag
We need to do this, otherwise the io schedulers don't get access to the
sync flag. Then they cannot tell the difference between a regular write
and an O_DIRECT write, which can cause a performance loss.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-13 13:02:26 +01:00
FUJITA Tomonori
335302618f [PATCH] remove unnecessary blk_queue_bounce in SG_IO
When I converted the original patch, I left unnecessary blk_queue_bounce in
SG_IO.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-12 10:26:55 +01:00
FUJITA Tomonori
77d172ce27 [PATCH] fix SG_IO bio leak
This patch fixes bio leaks in SG_IO. rq->bio can be changed after io
completion, so we need to reset rq->bio before calling blk_rq_unmap_user()

http://marc.theaimsgroup.com/?l=linux-kernel&m=116570666807983&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-12 10:22:23 +01:00
Boaz Harrosh
2b02a17920 [PATCH] remove blk_queue_activity_fn
While working on bidi support at struct request level
I have found that blk_queue_activity_fn is actually never used.
The only user is in ide-probe.c with this code:

	/* enable led activity for disk drives only */
	if (drive->media == ide_disk && hwif->led_act)
		blk_queue_activity_fn(q, hwif->led_act, drive);

And led_act is never initialized anywhere.
(Looking back at older kernels it was used in the PPC arch, but was removed around 2.6.18)
Unless it is all for future use off course.
(this patch is against linux-2.6-block.git as off 2006/12/4)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-12 10:22:23 +01:00
Andrew Morton
faccbd4b26 [PATCH] io-accounting: read accounting
Wire up read accounting for block devices, within submit_bio().

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Akinobu Mita
c17bb49517 [PATCH] fault-injection capability for disk IO
This patch provides fault-injection capability for disk IO.

Boot option:

fail_make_request=<probability>,<interval>,<space>,<times>

	<interval> -- specifies the interval of failures.

	<probability> -- specifies how often it should fail in percent.

	<space> -- specifies the size of free space where disk IO can be issued
		   safely in bytes.

	<times> -- specifies how many times failures may happen at most.

Debugfs:

/debug/fail_make_request/interval
/debug/fail_make_request/probability
/debug/fail_make_request/specifies
/debug/fail_make_request/times

Example:

	fail_make_request=10,100,0,-1
	echo 1 > /sys/blocks/hda/hda1/make-it-fail

generic_make_request() on /dev/hda1 fails once per 10 times.

Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:02 -08:00
Josef Sipek
c5a20b6c26 [PATCH] struct path: convert block
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:44 -08:00
Peter Zijlstra
2e7b651df1 [PATCH] remove the old bd_mutex lockdep annotation
Remove the old complex and crufty bd_mutex annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Ingo Molnar
0231606785 [PATCH] hotplug CPU: clean up hotcpu_notifier() use
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

    text    data     bss     dec     hex filename
 1624412  728710 3674856 6027978  5bfaca vmlinux.before
 1624412  728710 3674856 6027978  5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Christoph Lameter
e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Alan Stern
a120586873 [PATCH] Allow NULL pointers in percpu_free
The patch (as824b) makes percpu_free() ignore NULL arguments, as one would
expect for a deallocation routine.  (Note that free_percpu is #defined as
percpu_free in include/linux/percpu.h.) A few callers are updated to remove
now-unneeded tests for NULL.  A few other callers already seem to assume
that passing a NULL pointer to percpu_free() is okay!

The patch also removes an unnecessary NULL check in percpu_depopulate().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:22 -08:00
David Howells
4796b71fbb Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/pcmcia/ds.c

Fix up merge failures with Linus's head and fix new compile failures.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-06 15:01:18 +00:00
Linus Torvalds
ec0bf39a47 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (73 commits)
  [SCSI] aic79xx: Add ASC-29320LPE ids to driver
  [SCSI] stex: version update
  [SCSI] stex: change wait loop code
  [SCSI] stex: add new device type support
  [SCSI] stex: update device id info
  [SCSI] stex: adjust default queue length
  [SCSI] stex: add value check in hard reset routine
  [SCSI] stex: fix controller_info command handling
  [SCSI] stex: fix biosparam calculation
  [SCSI] megaraid: fix MMIO casts
  [SCSI] tgt: fix undefined flush_dcache_page() problem
  [SCSI] libsas: better error handling in sas_expander.c
  [SCSI] lpfc 8.1.11 : Change version number to 8.1.11
  [SCSI] lpfc 8.1.11 : Misc Fixes
  [SCSI] lpfc 8.1.11 : Add soft_wwnn sysfs attribute, rename soft_wwn_enable
  [SCSI] lpfc 8.1.11 : Removed decoding of PCI Subsystem Id
  [SCSI] lpfc 8.1.11 : Add MSI (Message Signalled Interrupts) support
  [SCSI] lpfc 8.1.11 : Adjust LOG_FCP logging
  [SCSI] lpfc 8.1.11 : Fix Memory leaks
  [SCSI] lpfc 8.1.11 : Fix lpfc_multi_ring_support
  ...
2006-12-05 16:09:46 -08:00
David Howells
9db7372445 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/ata/libata-scsi.c
	include/linux/libata.h

Futher merge of Linus's head and compilation fixups.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 17:01:28 +00:00
David Howells
4c1ac1b491 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/infiniband/core/iwcm.c
	drivers/net/chelsio/cxgb2.c
	drivers/net/wireless/bcm43xx/bcm43xx_main.c
	drivers/net/wireless/prism54/islpci_eth.c
	drivers/usb/core/hub.h
	drivers/usb/input/hid-core.c
	net/core/netpoll.c

Fix up merge failures with Linus's head and fix new compilation failures.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 14:37:56 +00:00
Matthew Wilcox
e62438630c [PATCH] Centralise definitions of sector_t and blkcnt_t
CONFIG_LBD and CONFIG_LSF are spread into asm/types.h for no particularly
good reason.

Centralising the definition in linux/types.h means that arch maintainers
don't need to bother adding it, as well as fixing the problem with
x86-64 users being asked to make a decision that has absolutely no
effect.

The H8/300 porters seem particularly confused since I'm not aware of any
microcontrollers that need to support 2TB filesystems.

Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-04 19:41:15 -08:00
Jens Axboe
a863055b10 [PATCH] blktrace: don't return blktrace_seq from trace_note()
Only the process notifier needs it, and it can set it manually.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-04 09:30:58 +01:00
Jens Axboe
d3d9d2a5ea [PATCH] blktrace: uninline trace_note()
It's too large to inline. Additionally clean it up, by fast pathing
the likely path.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-04 09:27:41 +01:00
Jens Axboe
bb37b94c68 [BLOCK] Cleanup unused variable passing
- ->init_queue() does not need the elevator passed in
- ->put_request() is a hot path and need not have the queue passed in
- cfq_update_io_seektime() does not need cfqd passed in

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-01 10:42:33 +01:00
Mike Christie
0e75f9063f [PATCH] block: support larger block pc requests
This patch modifies blk_rq_map/unmap_user() and the cdrom and scsi_ioctl.c
users so that it supports requests larger than bio by chaining them together.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-01 10:40:55 +01:00
Olaf Kirch
be1c63411a [PATCH] blktrace: add timestamp message
This adds a new timestamp message to blktrace, giving the timeofday when
we starting tracing. This helps user space correlate block trace events
with eg an application strace.

This requires a (compatible) update to blkparse. The changed blkparse
is still able to process traces generated by older kernels, and older
versions of blkparse should silently ignore the new records (because
they have a pid of 0).

Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-12-01 10:39:12 +01:00
James Bottomley
0bd2af4683 Merge ../scsi-rc-fixes-2.6 2006-11-22 12:06:44 -06:00
David Howells
65f27f3844 WorkStruct: Pass the work_struct pointer instead of context data
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct.  This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function.  This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated..  This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems.  But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default.  Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:55:48 +00:00
Tejun Heo
097b8457da [PATCH] scsi: clear garbage after CDBs on SG_IO
ATAPI devices transfer fixed number of bytes for CDBs (12 or 16).  Some
ATAPI devices choke when shorter CDB is used and the left bytes contain
garbage.  Block SG_IO cleared left bytes but SCSI SG_IO didn't.  This patch
makes SCSI SG_IO clear it and simplify CDB clearing in block SG_IO.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Mathieu Fluhr <mfluhr@nero.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Douglas Gilbert <dougg@torque.net>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: <stable@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-16 11:43:38 -08:00
Hannes Reinecke
85e04e371b [SCSI] block: convert jiffies to msecs in scsi_ioctl()
Use the proper conversion function for convert jiffies to msecs in
sg_io().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2006-11-15 17:57:33 -06:00
Jens Axboe
616e8a091a [PATCH] Fix bad data direction in SG_IO
Contrary to what the name misleads you to believe, SG_DXFER_TO_FROM_DEV
is really just a normal read seen from the device side.

This patch fixes http://lkml.org/lkml/2006/10/13/100

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13 09:47:00 -08:00
Andrew Morton
df66b8552b [PATCH] tidy "md: check bio address after mapping through partitions"
Neil's xterms are too wide.

Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:55 -08:00
Jens Axboe
5fccbf61be [PATCH] CFQ: request <-> request merging rr_list fixup
In very rare circumstances would we be pruning a merged request and at
the same time delete the implicated cfqq from the rr_list, and not readd
it when the merged request got added. This could cause io stalls until
that process issued io again.

Fix it up by putting the rr_list add handling into cfq_add_rq_rb(),
identical to how pruning is handled in cfq_del_rq_rb(). This fixes a
hang reproducible with fsx-linux.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-31 08:12:45 -08:00
NeilBrown
5ddfe9691c [PATCH] md: check bio address after mapping through partitions.
Partitions are not limited to live within a device.  So we should range
check after partition mapping.

Note that 'maxsector' was being used for two different things.  I have
split off the second usage into 'old_sector' so that maxsector can be still
be used for it's primary usage later in the function.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-31 08:07:01 -08:00
Jens Axboe
c1b707d253 [PATCH] CFQ: bad locking in changed_ioprio()
When the ioprio code recently got juggled a bit, a bug was introduced.
changed_ioprio() is no longer called with interrupts disabled, so using
plain spin_lock() on the queue_lock is a bug.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 11:01:50 -08:00
Jens Axboe
0261d6886e [PATCH] CFQ: use irq safe locking in cfq_cic_link()
If cfq_set_request() is called for a new process AND a non-fs io
request (so that __GFP_WAIT may not be set), cfq_cic_link() may
use spin_lock_irq() and spin_unlock_irq() with interrupts already
disabled.

Fix is to always use irq safe locking in cfq_cic_link()

Acked-By: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 10:21:58 -08:00
Andrew Morton
3fcfab16c5 [PATCH] separate bdi congestion functions from queue congestion functions
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.

The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.

This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.

Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Thomas Maier
79e2de4bc5 [PATCH] export clear_queue_congested and set_queue_congested
Export the clear_queue_congested() and set_queue_congested() functions
located in ll_rw_blk.c

The functions are renamed to blk_clear_queue_congested() and
blk_set_queue_congested().

(needed in the pktcdvd driver's bio write congestion control)

Signed-off-by: Thomas Maier <balagi@justmail.de>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:35 -07:00
Vasily Tarasov
c584164224 [PATCH] block layer: elv_iosched_show should get elv_list_lock
elv_iosched_show function iterates other elv_list, hence
elv_list_lock should be got.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Vasily Tarasov <jens.axboe@oracle.com>
2006-10-12 15:08:51 +02:00
Vasily Tarasov
a22b169df1 [PATCH] block layer: elevator_find function cleanup
We can easily produce search through the elevator list
without introducing additional elevator_type variable.

Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-10-12 15:08:51 +02:00
David C Somayajulu
f583f4924d [PATCH] helper function for retrieving scsi_cmd given host based block layer tag
This was necessitated by the need for a function to get back
to a scsi_cmnd, when an hba the posts its (corresponding) completion
interrupt with a block layer tag as its reference.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Somayajulu <david.somayajulu@qlogic.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2006-10-04 19:32:09 +02:00
Alasdair G Kergon
7006f6eca8 [PATCH] dm: export blkdev_driver_ioctl
Export blkdev_driver_ioctl for device-mapper.

If we get as far as the device-mapper ioctl handler, we know the ioctl is not
a standard block layer BLK* one, so we don't need to check for them a second
time and can call blkdev_driver_ioctl() directly.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:13 -07:00
Peter Zijlstra
6e9a4738c9 [PATCH] completions: lockdep annotate on stack completions
All on stack DECLARE_COMPLETIONs should be replaced by:
DECLARE_COMPLETION_ONSTACK

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:24 -07:00
Jens Axboe
51d7513a8a [PATCH] Only enable CONFIG_BLOCK option for embedded
It's too easy for people to shoot themselves in the foot, and it
only makes sense for embedded folks anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 21:14:05 +02:00
Jens Axboe
059af497c2 [PATCH] blk_queue_start_tag() shared map race fix
If we share the tag map between two or more queues, then we cannot
use __set_bit() to set the bit. In fact we need to make sure we
atomically acquire this tag, so loop using test_and_set_bit() to
protect from that.

Noticed by Mike Christie <michaelc@cs.wisc.edu>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:34 +02:00
Jens Axboe
0fe2347957 [PATCH] Update axboe@suse.de email address
As people often look for the copyright in files to see who to mail,
update the link to a neutral one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:34 +02:00
David Howells
9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
Martin Peschke
4090959aee [PATCH] blktrace: cleanup using on_each_cpu
This patch kills a few lines of code in blktrace by making use of
on_each_cpu().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:19 +02:00
Oleg Nesterov
25034d7a83 [PATCH] exit_io_context: don't disable irqs
We don't need to disable irqs to clear current->io_context, it is protected
by ->alloc_lock. Even IF it was possible to submit I/O from IRQ on behalf of
current this irq_disable() can't help: current_io_context() will re-instantiate
->io_context after irq_enable().

We don't need task_lock() or local_irq_disable() to clear ioc->task. This can't
prevent other CPUs from playing with our io_context anyway.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:18 +02:00
Jens Axboe
7457e6e2d7 [PATCH] blktrace: support for logging metadata reads
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:43 +02:00
Jens Axboe
374f84ac39 [PATCH] cfq-iosched: use metadata read flag
Give meta data reads preference over regular reads, as the process
often needs to get that out of the way to do the io it was actually
interested in.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:43 +02:00
Jens Axboe
5404bc7a87 [PATCH] Allow file systems to differentiate between data and meta reads
We can use this information for making more intelligent priority
decisions, and it will also be useful for blktrace.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:42 +02:00
Jens Axboe
da20a20f3b [PATCH] ll_rw_blk: allow more flexibility for read_ahead_kb store
It can make sense to set read-ahead larger than a single request.
We should not be enforcing such policy on the user. Additionally,
using the BLKRASET ioctl doesn't impose such a restriction. So
additionally we now expose identical behaviour through the two.

Issue also reported by Anton <cbou@mail.ru>

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:41 +02:00
Jens Axboe
bf57225670 [PATCH] cfq-iosched: improve queue preemption
Don't touch the current queues, just make sure that the wanted queue
is selected next. Simplifies the logic.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:41 +02:00
Jens Axboe
dc72ef4ae3 [PATCH] Add blk_start_queueing() helper
CFQ implements this on its own now, but it's really block layer
knowledge. Tells a device queue to start dispatching requests to
the driver, taking care to unplug if needed. Also fixes the issue
where as/cfq will invoke a stopped queue, which we really don't
want.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:40 +02:00
Jens Axboe
981a79730d [PATCH] cfq-iosched: kill the empty_list
No point in having a place holder list just for empty queues, so remove
it. It's not used for anything other than to keep ->cfq_list busy.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:40 +02:00
Jens Axboe
53b03744e5 [PATCH] cfq-iosched: Kill O(N) runtime of cfq_resort_rr_list()
Currently it scales with number of processes in that priority group,
which is potentially not very nice as it's called quite often.
Basically we always need to do tail inserts, except for the case of a
new process. So just mark/detect a queue as such.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:39 +02:00
Jens Axboe
b5deef9012 [PATCH] Make sure all block/io scheduler setups are node aware
Some were kmalloc_node(), some were still kmalloc(). Change them all to
kmalloc_node().

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:39 +02:00
Jens Axboe
1ea25ecb72 [PATCH] Audit block layer inlines
Kill a few inlines that bring in too much code to more than one location
Shrinks kernel text by about 300 bytes on 32-bit x86.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:38 +02:00
Jens Axboe
4050cf1674 [PATCH] cfq-iosched: use new io context counting mechanism
It's ok if the read path is a lot more costly, as long as inc/dec is
really cheap. The inc/dec will happen for each created/freed io context,
while the reading only happens when a disk queue exits.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:37 +02:00
Jens Axboe
e4313dd423 [PATCH] as-iosched: use new io context counting mechanism
It's ok if the read path is a lot more costly, as long as inc/dec is
really cheap. The inc/dec will happen for each created/freed io context,
while the reading only happens when a disk queue exits.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:37 +02:00
Jens Axboe
fc46379daf [PATCH] cfq-iosched: kill cfq_exit_lock
cfq_exit_lock is protecting two things now:

- The per-ioc rbtree of cfq_io_contexts

- The per-cfqd linked list of cfq_io_contexts

The per-cfqd linked list can be protected by the queue lock, as it is (by
definition) per cfqd as the queue lock is.

The per-ioc rbtree is mainly used and updated by the process itself only.
The only outside use is the io priority changing. If we move the
priority changing to not browsing the rbtree, we can remove any locking
from the rbtree updates and lookup completely. Let the sys_ioprio syscall
just mark processes as having the iopriority changed and lazily update
the private cfq io contexts the next time io is queued, and we can
remove this locking as well.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:36 +02:00
Jens Axboe
89850f7ee9 [PATCH] cfq-iosched: cleanups, fixes, dead code removal
A collection of little fixes and cleanups:

- We don't use the 'queued' sysfs exported attribute, since the
  may_queue() logic was rewritten. So kill it.

- Remove dead defines.

- cfq_set_active_queue() can be rewritten cleaner with else if conditions.

- Several places had cfq_exit_cfqq() like logic, abstract that out and
  use that.

- Annotate the cfqq kmem_cache_alloc() so the allocator knows that this
  is a repeat allocation if it fails with __GFP_WAIT set. Allows the
  allocator to start freeing some memory, if needed. CFQ already loops for
  this condition, so might as well pass the hint down.

- Remove cfqd->rq_starved logic. It's not needed anymore after we dropped
  the crq allocation in cfq_set_request().

- Remove uneeded parameter passing.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:35 +02:00
Jens Axboe
51da90fcb6 [PATCH] ll_rw_blk: cleanup __make_request()
- Don't assign variables that are only used once.

- Kill spin_lock() prefetching, it's opportunistic at best.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:34 +02:00
Jens Axboe
cb78b285c8 [PATCH] Drop useless bio passing in may_queue/set_request API
It's not needed for anything, so kill the bio passing.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:23 +02:00
Jens Axboe
cdd6026217 [PATCH] Remove ->rq_status from struct request
After Christophs SCSI change, the only usage left is RQ_ACTIVE
and RQ_INACTIVE. The block layer sets RQ_INACTIVE right before freeing
the request, so any check for RQ_INACTIVE in a driver is a bug and
indicates use-after-free.

So kill/clean the remaining users, straight forward.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:23 +02:00
Jens Axboe
49171e5c6f [PATCH] Remove struct request_list from struct request
It is always identical to &q->rq, and we only use it for detecting
whether this request came out of our mempool or not. So replace it
with an additional ->flags bit flag.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:22 +02:00
Jens Axboe
c00895ab2f [PATCH] Remove ->waiting member from struct request
As the comments indicates in blkdev.h, we can fold it into ->end_io_data
usage as that is really what ->waiting is. Fixup the users of
blk_end_sync_rq().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:29:12 +02:00
Jens Axboe
8a8e674cb1 [PATCH] as-iosched: kill arq
Get rid of the as_rq request type. With the added elevator_private2, we
have enough room in struct request to get rid of any arq allocation/free
for each request.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:27:02 +02:00
Jens Axboe
5e70537479 [PATCH] cfq-iosched: kill crq
Get rid of the cfq_rq request type. With the added elevator_private2, we
have enough room in struct request to get rid of any crq allocation/free
for each request.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:27:02 +02:00
Jens Axboe
5380a101d3 [PATCH] cfq-iosched: remove the crq flag functions/variable
There's just one flag currently (SYNC), and that one can be grabbed from
the request.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:27:01 +02:00
Jens Axboe
8840faa1ee [PATCH] deadline-iosched: remove elevator private drq request type
A big win, we now save an allocation/free on each request! With the
previous rb/hash abstractions, we can just reuse queuelist/donelist
for the FIFO data and be done with it.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:27:00 +02:00
Jens Axboe
9e2585a8a2 [PATCH] as-iosched: remove arq->is_sync member
We can track this in struct request.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:27:00 +02:00
Jens Axboe
d4f2f4629e [PATCH] as-iosched: reuse rq for fifo
Saves some space in arq.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:27:00 +02:00
Jens Axboe
95e8810b28 [PATCH] cfq-iosched: convert to using the FIFO elevator defines
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:59 +02:00
Jens Axboe
b8aca35af5 [PATCH] deadline-iosched: migrate to using the elevator rb functions
This removes the rbtree handling from deadline.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:58 +02:00
Jens Axboe
21183b07ee [PATCH] cfq-iosched: migrate to using the elevator rb functions
This removes the rbtree handling from CFQ.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:58 +02:00
Jens Axboe
e37f346e34 [PATCH] as-iosched: migrate to using the elevator rb functions
This removes the rbtree handling from AS.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
2006-09-30 20:26:57 +02:00
Jens Axboe
2e662b65f0 [PATCH] elevator: abstract out the rbtree sort handling
The rbtree sort/lookup/reposition logic is mostly duplicated in
cfq/deadline/as, so move it to the elevator core. The io schedulers
still provide the actual rb root, as we don't want to impose any sort
of specific handling on the schedulers.

Introduce the helpers and rb_node in struct request to help migrate the
IO schedulers.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:57 +02:00
Jens Axboe
10fd48f237 [PATCH] rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev
The conditions got reserved. Also make rb_next() and rb_prev() check
for the empty condition.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:56 +02:00
Jens Axboe
9817064b68 [PATCH] elevator: move the backmerging logic into the elevator core
Right now, every IO scheduler implements its own backmerging (except for
noop, which does no merging). That results in duplicated code for
essentially the same operation, which is never a good thing. This patch
moves the backmerging out of the io schedulers and into the elevator
core. We save 1.6kb of text and as a bonus get backmerging for noop as
well. Win-win!

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:26:56 +02:00
Jens Axboe
4aff5e2333 [PATCH] Split struct request ->flags into two parts
Right now ->flags is a bit of a mess: some are request types, and
others are just modifiers. Clean this up by splitting it into
->cmd_type and ->cmd_flags. This allows introduction of generic
Linux block message types, useful for sending generic Linux commands
to block devices.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:23:37 +02:00
Alexey Dobriyan
6c5c934153 [PATCH] ifdef blktrace debugging fields
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Randy Dunlap
87a5726110 [PATCH] block: handle subsystem_register() init errors
Check and handle init errors.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:05 -07:00
Theodore Ts'o
8e18e2941c [PATCH] inode_diet: Replace inode.u.generic_ip with inode.i_private
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86.  (It would be more on an x86_64 system).  This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).

This patch:

The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer.  Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer.  This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.

[judith@osdl.org: powerpc build fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Judith Lebzelter <judith@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:17 -07:00
James Bottomley
1aedf2ccc6 Merge mulgrave-w:git/linux-2.6
Conflicts:

	include/linux/blkdev.h

Trivial merge to incorporate tag prototypes.
2006-09-23 21:03:52 -05:00
Trond Myklebust
275a082fe9 Add a real API for dealing with blk_congestion_wait()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2006-09-22 23:24:54 -04:00
James Bottomley
492dfb4896 [SCSI] block: add support for shared tag maps
The current block queue implementation already contains most of the
machinery for shared tag maps.  The only remaining pieces are a way to
allocate and destroy a tag map independently of the queues (so that
the maps can be managed on the life cycle of the overseeing entity)

Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2006-08-31 11:17:18 -04:00
Oleg Nesterov
2d8f613160 elv_unregister: fix possible crash on module unload
An exiting task or process which didn't do I/O yet have no io context,
elv_unregister() should check it is not NULL.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-22 12:52:23 -07:00
Oleg Nesterov
be33c3a67b [PATCH] cfq_cic_link: fix usage of wrong cfq_io_context
Obviously, cfq_cic_link() shouldn't free a just allocated cfq_io_context?
The dead key is from __cic, so drop that.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-08-21 10:02:54 +02:00
Oleg Nesterov
9f83e45eb5 [PATCH] Fix current_io_context() vs set_task_ioprio() race
I know nothing about io scheduler, but I suspect set_task_ioprio() is not safe.

current_io_context() initializes "struct io_context", then sets ->io_context.
set_task_ioprio() running on another cpu may see the changes out of order, so
->set_ioprio(ioc) may use io_context which was not initialized properly.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-08-21 08:34:15 +02:00
Jens Axboe
44eb123126 [PATCH] cfq-iosched: don't use a hard jiffies value, translate from msecs
The CIC_SEEKY() test really wants to use the minimum of either:

- 2 msecs (not jiffies)

- or, the pending slice time

So code it like that.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-25 15:05:21 +02:00
Milton Miller
ad01b1ca79 [PATCH] blktrace: fix read-ahead bit
It should be toggling the same bit on and off, fix it up.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-25 15:04:13 +02:00
Arjan van de Ven
ddca60c590 [PATCH] lockdep: annotate the BLKPG_DEL_PARTITION ioctl
The delete partition IOCTL takes the bd_mutex for both the disk and the
partition; these have an obvious hierarchical relationship and this patch
annotates this relationship for lockdep.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:53 -07:00
Jens Axboe
1959d21232 [PATCH] Only the first two bits in bio->bi_rw and rq->flags match
Not three, as assumed. This causes the barrier bit to be needlessly set
for some IO.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-06 10:18:05 +02:00
Nathan Scott
40359ccb83 [PATCH] blktrace: readahead support
Provide the needed kernel support for distinguishing readahead
from regular read requests when tracing block devices.

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-07-06 10:03:28 +02:00
Ingo Molnar
60be6b9a41 [PATCH] lockdep: annotate on-stack completions
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues
implicitly initialized by DECLARE_COMPLETION().  Annotate on-stack completions
accordingly.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Linus Torvalds
22a3e233ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  Remove obsolete #include <linux/config.h>
  remove obsolete swsusp_encrypt
  arch/arm26/Kconfig typos
  Documentation/IPMI typos
  Kconfig: Typos in net/sched/Kconfig
  v9fs: do not include linux/version.h
  Documentation/DocBook/mtdnand.tmpl: typo fixes
  typo fixes: specfic -> specific
  typo fixes in Documentation/networking/pktgen.txt
  typo fixes: occuring -> occurring
  typo fixes: infomation -> information
  typo fixes: disadvantadge -> disadvantage
  typo fixes: aquire -> acquire
  typo fixes: mecanism -> mechanism
  typo fixes: bandwith -> bandwidth
  fix a typo in the RTC_CLASS help text
  smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30 15:39:30 -07:00
Christoph Lameter
f8891e5e1f [PATCH] Light weight event counters
The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat.  They have no
essential function for the VM.

We use a simple increment of per cpu variables.  In order to avoid the most
severe races we disable preempt.  Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter.  However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.

In the non preempt case this results in a simple increment for each
counter.  For many architectures this will be reduced by the compiler to a
single instruction.  This single instruction is atomic for i386 and x86_64.
 And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.

The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.

The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).

Benefits:
- VM event counter operations usually reduce to a single inline instruction
  on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
  Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.

References:

RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Jörn Engel
6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Chandra Seetharaman
5a67e4c5b6 [PATCH] cpu hotplug: use hotplug version of cpu notifier in appropriate places
Make use the of newly defined hotplug version of cpu_notifier functionality
wherever appropriate.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
054cc8a2d8 [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17
This patch reverts notifier_block changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Andreas Mohr
d6e05edc59 spelling fixes
acquired (aquired)
contiguous (contigious)
successful (succesful, succesfull)
surprise (suprise)
whether (weather)
some other misspellings

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-26 18:35:02 +02:00
Andi Kleen
8269730b38 [BLOCK] Fix bounce limit address check
Do a safer check for when to enable DMA. Currently we enable ISA DMA
for cases that do not need it, resulting in OOM conditions when ZONE_DMA
runs out of space.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe
dd67d05152 [PATCH] rbtree: support functions used by the io schedulers
They all duplicate macros to check for empty root and/or node, and
clearing a node. So put those in rbtree.h.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe
fd61af0384 [PATCH] cfq-iosched: rq update fixes
- Remember to set ->last_sector so that the cfq_choose_req() logic
  works correctly.

- Remove redundant call to cfq_choose_req()

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe
caaa5f9f0a [PATCH] cfq-iosched: many performance fixes
This is a collection of patches that greatly improve CFQ performance
in some circumstances.

- Change the idling logic to only kick in after a request is done and we
  are deciding what to do. Before the idling included the request service
  time, so it was hard to adjust. Now it's true think/idle time.

- Take advantage of TCQ/NCQ/queueing for seeky sync workloads, but keep
  it in control for sync and sequential (or close to) workloads.

- Expire queues immediately and move on to other busy queues, if we are
  not going to idle after the current one finishes.

- Don't rearm idle timer if there are no busy queues. Just leave the
  system idle.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe
35e6077cb1 [PATCH] cfq-iosched: correctly set ioprio on both targets
Patch originally from Vasily Tarasov <vtaras@sw.ru>

If you set io-priority of process 1 using sys_ioprio_set system call by
another process 2 (like ionice do), then cfq_init_prio_data() function
sets priority of process 2 (current) on queue of process 1 and clears
the flag, that designates change of ioprio.  So the process  1 will work
like with priority of process 2.

I propose not to call cfq_init_prio_data() on io-priority change, but
only mark queue as queue with changed prority.  Every time when new
request comes cfq-scheduler checks for this flag and atomaticaly changes
priority of queue to new value.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe
b17fd9bceb [PATCH] Make CFQ the default IO scheduler
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe
b31dc66a54 [PATCH] Kill PF_SYNCWRITE flag
A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.

Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:39 +02:00
Jens Axboe
271f18f102 [PATCH] cfq-iosched: Don't set the queue batching limits
We cannot update them if the user changes nr_requests, so don't
set it in the first place. The gains are pretty questionable as
well. The batching loss has been shown to decrease throughput.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:38 +02:00
Dave Jones
acf4217555 [PATCH] remove dead code from elevator switching
We already drop the refcount in elevator_exit(), and as
we're setting 'e' to NULL, we'll never take that branch anyway.
Finally, as 'e' is a local var that isn't referenced afterwards,
setting it to NULL is pointless.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:38 +02:00
Paolo 'Blaisorblade' Giarrusso
a038e25364 [PATCH] blk_start_queue() must be called with irq disabled - add warning
The queue lock can be taken from interrupts so it must always be taken with
irq disabling primitives.  Some primitives already verify this.
blk_start_queue() is called under this lock, so interrupts must be
disabled.

Also document this requirement clearly in blk_init_queue(), where the queue
spinlock is set.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-23 17:10:38 +02:00
Akinobu Mita
bae386f788 [PATCH] iosched: use hlist for request hashtable
Use hlist instead of list_head for request hashtable in deadline-iosched
and as-iosched. It also can remove the flag to know hashed or unhashed.

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Jens Axboe <axboe@suse.de>

 block/as-iosched.c       |   45 +++++++++++++++++++--------------------------
 block/deadline-iosched.c |   39 ++++++++++++++++-----------------------
 2 files changed, 35 insertions(+), 49 deletions(-)
2006-06-23 17:10:38 +02:00
Oleg Nesterov
626ab0e69d [PATCH] list: use list_replace_init() instead of list_splice_init()
list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1.  We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Kay Sievers
b9d9c82b4d [PATCH] Driver core: add generic "subsystem" link to all devices
Like the SUBSYTEM= key we find in the environment of the uevent, this
creates a generic "subsystem" link in sysfs for every device. Userspace
usually doesn't care at all if its a "class" or a "bus" device. This
provides an unified way to determine the subsytem of a device, regardless
of the way the driver core has created it.

Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-21 12:40:49 -07:00
Linus Torvalds
6b41fd1785 Fix up CFQ scheduler for recent rbtree node shrinkage
The color is now in the low bits of the parent pointer, and initializing
it to 0 happens as part of the whole memset above, so just remove the
unnecessary RB_CLEAR_COLOR.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-20 19:44:03 -07:00
Linus Torvalds
2edc322d42 Merge git://git.infradead.org/~dwmw2/rbtree-2.6
* git://git.infradead.org/~dwmw2/rbtree-2.6:
  [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
  Update UML kernel/physmem.c to use rb_parent() accessor macro
  [RBTREE] Update hrtimers to use rb_parent() accessor macro.
  [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
  [RBTREE] Merge colour and parent fields of struct rb_node.
  [RBTREE] Remove dead code in rb_erase()
  [RBTREE] Update JFFS2 to use rb_parent() accessor macro.
  [RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
  [RBTREE] Update key.c to use rb_parent() accessor macro.
  [RBTREE] Update ext3 to use rb_parent() accessor macro.
  [RBTREE] Change rbtree off-tree marking in I/O schedulers.
  [RBTREE] Add accessor macros for colour and parent fields of rb_node
2006-06-20 14:51:22 -07:00
Jens Axboe
553698f944 [PATCH] cfq-iosched: fix crash in do_div()
We don't clear the seek stat values in cfq_alloc_io_context(), and if
->seek_mean is unlucky enough to be set to -36 by chance, the first
invocation of cfq_update_io_seektime() will oops with a divide by zero
in do_div().

Just memset the entire cic instead of filling invididual values
independently.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-14 10:22:16 -07:00
Jens Axboe
bc1c116974 [PATCH] elevator switching race
There's a race between shutting down one io scheduler and firing up the
next, in which a new io could enter and cause the io scheduler to be
invoked with bad or NULL data.

To fix this, we need to maintain the queue lock for a bit longer.
Unfortunately we cannot do that, since the elevator init requires to be
run without the lock held.  This isn't easily fixable, without also
changing the mempool API.  So split the initialization into two parts,
and alloc-init operation and an attach operation.  Then we can
preallocate the io scheduler and related structures, and run the attach
inside the lock after we detach the old one.

This patch has survived 30 minutes of 1 second io scheduler switching
with a very busy io load.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-08 15:14:23 -07:00
Jens Axboe
b52a834892 [PATCH] cfq-iosched: busy_rr fairness fix
Now that we select busy_rr for possible service, insert entries at the
back of that list instead of at the front.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-01 18:53:43 +02:00
Jens Axboe
ae818a38d4 [PATCH] cfq-iosched: fix bug in timer handling for the idle class
There's a small window from when the timer is entered and we grab
the queue lock, where cfq_set_active_queue() could be rearming the
timer for us. Seen in the wild on a 12-way ppc box. Fix this by
just using mod_timer(), which will do the right thing for us.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-01 10:13:43 +02:00
Jens Axboe
25776e3594 [PATCH] cfq-iosched: Detect hardware queueing
If the hardware is doing real queueing, decide that it's worthless to
idle the hardware. It does reasonable simultaneous io in that case
anyways, and the idling hurts some work loads.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-01 10:12:26 +02:00
Jens Axboe
12e9fddd6e [PATCH] cfq-iosched: Detect idle process issuing async request
If we are anticipating a sync request from this process and we are
waiting for that and see an async request come in, expire that slice
and move on.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-01 10:09:56 +02:00
Jens Axboe
e0de0206a2 [PATCH] cfq-iosched: check busy queues before deciding we are idle
For just one busy queue (like async write out), we often overlooked
that we could queue more io and decided we were idle instead. This causes
us quite a bit of performance loss.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-06-01 10:07:26 +02:00
Jens Axboe
3793c65c13 [PATCH] cfq-iosched: fixup locking and ->queue_list list management
- Drop cic from the list when seen as dead.
- Fixup the locking, just use a simple spinlock.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-30 20:31:05 -07:00
Jens Axboe
fd0ff8aa1d [PATCH] blk: fix gendisk->in_flight accounting during barrier sequence
While executing barrrier sequence, the bar_rq which carries actual
write was accounted as normal IO on completion, while it wasn't on
queueing.  This caused gendisk->in_flight to be decremented by 1 after
each barrier thus messed up statistics.

This patch makes bar_rq not accounted as normal IO.  As the containing
barrier request as a whole is accounted, part of it shouldn't be.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-23 10:39:43 -07:00
Linus Torvalds
1a2acc9e92 Revert "[BLOCK] Fix oops on removal of SD/MMC card"
This reverts commit 56cf6504fc.

Both Erik Mouw and Andrew Vasquez independently pinpointed this commit
as causing problems, where the slab cache for a driver is never released
(most obviously causing problems when immediately re-loading that
driver, resulting in a "kmem_cache_create: duplicate cache <xyz>"
message, but it can also cause other trouble).

James Bottomley dug into it, and reports:

  "OK, here's the scoop.  The problem patch adds a get of driverfs_dev in
   add_disk(), but doesn't put it again until disk_release() (which occurs
   on final put_disk() of the gendisk).

   However, in SCSI, the driverfs_dev is the sdev_gendev.  That means
   there's a reference held on sdev_gendev  until final disk put.
   Unfortunately, we use the driver model driver_remove to trigger
   del_gendisk (which removes the gendisk from visibility and decrements
   the refcount), so we've introduced an unbreakable deadlock in the
   reference counting with this.

   I suggest simply reversing this patch at the moment.  If Russell and
   Jens can tell me what they're trying to do I'll see if there's another
   way to do it."

so hereby the patch gets reverted, waiting for a better fix.

Cc: Jens Axboe <axboe@suse.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Erik Mouw <erik@harddisk-recovery.com>
Cc: Andrew Vasquez <andrew.vasquez@qlogic.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-12 12:08:46 -07:00
Jens Axboe
dac07ec121 [BLOCK] limit request_fn recursion
Don't recurse back into the driver even if the unplug threshold is met,
when the driver asks for a requeue. This is both silly from a logical
point of view (requeues typically happen due to driver/hardware
shortage), and also dangerous since we could hit an endless request_fn
-> requeue -> unplug -> request_fn loop and crash on stack overrun.

Also limit blk_run_queue() to one level of recursion, similar to how
blk_start_queue() works.

This patch fixed a real problem with SLES10 and lpfc, and it could hit
any SCSI lld that returns non-zero from it's ->queuecommand() handler.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-11 12:38:59 -07:00
Russell King
56cf6504fc [BLOCK] Fix oops on removal of SD/MMC card
The block layer keeps a reference (driverfs_dev) to the struct
device associated with the block device, and uses it internally
for generating uevents in block_uevent.

Block device uevents include umounting the partition, which can
occur after the backing device has been removed.

Unfortunately, this reference is not counted.  This means that
if the struct device is removed from the device tree, the block
layers reference will become stale.

Guard against this by holding a reference to the struct device
in add_disk(), and only drop the reference when we're releasing
the gendisk kobject - in other words when we can be sure that no
further uevents will be generated for this block device.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Jens Axboe <axboe@suse.de>
2006-05-05 17:57:52 +01:00
Chandra Seetharaman
649bbaa484 [PATCH] Remove __devinitdata from notifier block definitions
Few of the notifier_chain_register() callers use __devinitdata in the
definition of notifier_block data structure.  It is incorrect as the
data structure should be available after the initializations (they do
not unregister them during initializations).

This was leading to an oops when notifier_chain_register() call is
invoked for those callback chains after initialization.

This patch fixes all such usages to _not_ have the notifier_block data
structure in the init data section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:27:50 -07:00
David Woodhouse
3db3a44530 [RBTREE] Change rbtree off-tree marking in I/O schedulers.
They were abusing the rb_color field to mark nodes which weren't currently
on the tree. Fix that to use the same method as eventpoll did -- setting
the parent pointer to point back to itself. And use the appropriate
accessor macros for setting and reading the parent.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-04-21 13:15:17 +01:00
Adrian Bunk
4f73247f0e [PATCH] block/elevator.c: remove unused exports
This patch removes the following unused EXPORT_SYMBOL's:
- elv_requeue_request
- elv_completed_request

They are only used by the block core, hence they need not be exported.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-20 15:45:22 +02:00
Coywolf Qi Hunt
7daac49020 [patch] cleanup: use blk_queue_stopped
This cleanup the source to use blk_queue_stopped.

Signed-off-by: Coywolf Qi Hunt <qiyong@freeforge.net>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-20 13:04:36 +02:00
OGAWA Hirofumi
be3b075354 [PATCH] cfq: Further rbtree traversal and cfq_exit_queue() race fix
In current code, we are re-reading cic->key after dead cic->key check.
So, in theory, it may really re-read *after* cfq_exit_queue() seted NULL.

To avoid race, we copy it to stack, then use it. With this change, I
guess gcc will assign cic->key to a register or stack, and it wouldn't
be re-readed.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-18 19:18:31 +02:00
OGAWA Hirofumi
dbecf3ab40 [PATCH 2/2] cfq: fix cic's rbtree traversal
When queue dies, we set cic->key=NULL as dead mark. So, when we
traverse a rbtree, we must check whether it's still valid key. if it
was invalidated, drop it, then restart the traversal from top.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-18 09:45:18 +02:00
OGAWA Hirofumi
fba822722e [PATCH 1/2] iosched: fix typo and barrier()
On rmmod path, cfq/as waits to make sure all io-contexts was
freed. However, it's using complete(), not wait_for_completion().

I think barrier() is not enough in here. To avoid the following case,
this patch replaces barrier() with smb_wmb().

	cpu0			visibility			cpu1
	                [ioc_gnone=NULL,ioc_count=1]

ioc_gnone = &all_gone		NULL,ioc_count=1
atomic_read(&ioc_count)		NULL,ioc_count=1
wait_for_completion()		NULL,ioc_count=0	atomic_sub_and_test()
				NULL,ioc_count=0	if ( && ioc_gone)
						    [ioc_gone==NULL,
						    so doesn't call complete()]
			   &all_gone,ioc_count=0

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-18 09:44:06 +02:00