This patch adds bio_copy_kern similar to
bio_copy_user. blk_rq_map_kern uses bio_copy_kern instead of
bio_map_kern if necessary.
bio_copy_kern uses temporary pages and the bi_end_io callback frees
these pages. bio_copy_kern saves the original kernel buffer at
bio->bi_private it doesn't use something like struct bio_map_data to
store the information about the caller.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This requires moving rq_init() from get_request() to blk_alloc_request().
The upside is that we can now require an rq_init() from any path that
wishes to hand the request to the block layer.
rq_init() will be exported for the code that uses struct request
without blk_get_request.
This is a preparation for large command support, which needs to
initialize struct request in a proper way (that is, just doing a
memset() will not work).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds release callback support, which is called when a bsg
device goes away. bsg_register_queue() takes a pointer to a callback
function. This feature is useful for stuff like sas_host that can't
use the release callback in struct device.
If a caller doesn't need bsg's release callback, it can call
bsg_register_queue() with NULL pointer (e.g. scsi devices can use
release callback in struct device so they don't need bsg's callback).
With this patch, bsg uses kref for refcounts on bsg devices instead of
get/put_device in fops->open/release. bsg calls put_device and the
caller's release callback (if it was registered) in kref_put's
release.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
* 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block:
block: fix blk_register_queue() return value
block: fix memory hotplug and bouncing in block layer
block: replace remaining __FUNCTION__ occurrences
Kconfig: clean up block/Kconfig help descriptions
cciss: fix warning oops on rmmod of driver
cciss: Fix race between disk-adding code and interrupt handler
block: move the padding adjustment to blk_rq_map_sg
block: add bio_copy_user_iov support to blk_rq_map_user_iov
block: convert bio_copy_user to bio_copy_user_iov
loop: manage partitions in disk image
cdrom: use kmalloced buffers instead of buffers on stack
cdrom: make unregister_cdrom() return void
cdrom: use list_head for cdrom_device_info list
cdrom: protect cdrom_device_info list by mutex
cdrom: cleanup hardcoded error-code
cdrom: remove ifdef CONFIG_SYSCTL
blk_register_queue() returns -ENXIO when queue->request_fn is NULL. But there
are some block drivers that call blk_register_queue() via add_disk() with
queue->request_fn == NULL. (For example, brd, loop)
Although no one checks return value of blk_register_queue(), this patch makes
it return 0 instead of -ENXIO when queue->request_fn is NULL,
Also this patch adds warning when blk_register_queue() and
blk_unregister_queue() are called with queue == NULL rather than ignore
invalid usage silently.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Modify the help descriptions of block/Kconfig for clarity, accuracy and
consistency.
Refactor the BLOCK description a bit. The wording "This permits ... to be
removed" isn't quite right; the block layer is removed when the option is
disabled, whereas most descriptions talk about what happens when the option is
enabled. Reformat the list of what is affected by disabling the block layer.
Add more examples of large block devices to LBD and strive for technical
accuracy; block devices of size _exactly_ 2TB require CONFIG_LBD, not only
"bigger than 2TB". Also try to say (perhaps not very clearly) that the config
option is only needed when you want to have individual block devices of size
>= 2TB, for example if you had 3 x 1TB disks in your computer you'd have a
total storage size of 3TB but you wouldn't need the option unless you want to
aggregate those disks into a RAID or LVM.
Improve terminology and grammar on BLK_DEV_IO_TRACE.
I also added the boilerplate "If unsure, say N" to most options.
Precisely say "2TB and larger" for LSF.
Indent the help text for BLK_DEV_BSG by 2 spaces in accordance with the
standard.
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
blk_rq_map_user adjusts bi_size of the last bio. It breaks the rule
that req->data_len (the true data length) is equal to sum(bio). It
broke the scsi command completion code.
commit e97a294ef6 was introduced to fix
the above issue. However, the partial completion code doesn't work
with it. The commit is also a layer violation (scsi mid-layer should
not know about the block layer's padding).
This patch moves the padding adjustment to blk_rq_map_sg (suggested by
James). The padding works like the drain buffer. This patch breaks the
rule that req->data_len is equal to sum(sg), however, the drain buffer
already broke it. So this patch just restores the rule that
req->data_len is equal to sub(bio) without breaking anything new.
Now when a low level driver needs padding, blk_rq_map_user and
blk_rq_map_user_iov guarantee there's enough room for padding.
blk_rq_map_sg can safely extend the last entry of a scatter list.
blk_rq_map_sg must extend the last entry of a scatter list only for a
request that got through bio_copy_user_iov. This patches introduces
new REQ_COPY_USER flag.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With this patch, blk_rq_map_user_iov uses bio_copy_user_iov when a low
level driver needs padding or a buffer in sg_iovec isn't aligned. That
is, it uses temporary kernel buffers instead of mapping user pages
directly.
When a LLD needs padding, later blk_rq_map_sg needs to extend the last
entry of a scatter list. bio_copy_user_iov guarantees that there is
enough space for padding by using temporary kernel buffers instead of
user pages.
blk_rq_map_user_iov needs buffers in sg_iovec to be aligned. The
comment in blk_rq_map_user_iov indicates that drivers/scsi/sg.c also
needs buffers in sg_iovec to be aligned. Actually, drivers/scsi/sg.c
works with unaligned buffers in sg_iovec (it always uses temporary
kernel buffers).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's big, but there doesn't seem to be a way to split it up smaller...
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (137 commits)
[SCSI] iscsi: bidi support for iscsi_tcp
[SCSI] iscsi: bidi support at the generic libiscsi level
[SCSI] iscsi: extended cdb support
[SCSI] zfcp: Fix error handling for blocked unit for send FCP command
[SCSI] zfcp: Remove zfcp_erp_wait from slave destory handler to fix deadlock
[SCSI] zfcp: fix 31 bit compile warnings
[SCSI] bsg: no need to set BSG_F_BLOCK bit in bsg_complete_all_commands
[SCSI] bsg: remove minor in struct bsg_device
[SCSI] bsg: use better helper list functions
[SCSI] bsg: replace kobject_get with blk_get_queue
[SCSI] bsg: takes a ref to struct device in fops->open
[SCSI] qla1280: remove version check
[SCSI] libsas: fix endianness bug in sas_ata
[SCSI] zfcp: fix compiler warning caused by poking inside new semaphore (linux-next)
[SCSI] aacraid: Do not describe check_reset parameter with its value
[SCSI] aacraid: Fix down_interruptible() to check the return value
[SCSI] sun3_scsi_vme: add MODULE_LICENSE
[SCSI] st: rename flush_write_buffer()
[SCSI] tgt: use KMEM_CACHE macro
[SCSI] initio: fix big endian problems for auto request sense
...
Before bsg_complete_all_commands is called, BSG_F_BLOCK bit is always
set.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
minor in struct bsg_device is used as identifier to find the
corresponding struct bsg_device_class. However, request_queuse can be
used as identifier for that and the minor in struct bsg_device is
unnecessary.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This replace hlist_for_each and list_entry with hlist_for_each_entry
and list_first_entry respectively.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Both takes a ref to a queue. But blk_get_queue checks QUEUE_FLAG_DEAD
and is more appropriate interface here.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
bsg_register_queue() takes a ref to struct device that a caller
passes. For example, bsg takes a ref to the sdev_gendev for scsi
devices. However, bsg doesn't inrease the refcount in fops->open. So
while an application opens a bsg device, the scsi device that the bsg
device holds can go away (bsg also takes a ref to a queue, but it
doesn't prevent the device from going away).
With this patch, bsg increases the refcount of struct device in
fops->open and decreases it in fops->release.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
hdparm explicitely marks HDIO_[UNREGISTER,SCAN]_HWIF ioctls as DANGEROUS
and given the number of bugs we can assume that there are no real users:
* DMA has no chance of working because DMA resources are released by
ide_unregister() and they are never allocated again.
* Since ide_init_hwif_ports() is used for ->io_ports[] setup the ioctls
don't work for almost all hosts with "non-standard" (== non ISA-like)
layout of IDE taskfile registers (there is a lot of such host drivers).
* ide_port_init_devices() is not called when probing IDE devices so:
- drive->autotune is never set and IDE host/devices are not programmed
for the correct PIO/DMA transfer modes (=> possible data corruption)
- host specific I/O 32-bit and IRQ unmasking settings are not applied
(=> possible data corruption)
- host specific ->port_init_devs method is not called (=> no luck with
ht6560b, qd65xx and opti621 host drivers)
* ->rw_disk method is not preserved (=> no HPT3xxN chipsets support).
* ->serialized flag is not preserved (=> possible data corruption when
using icside, aec62xx (ATP850UF chipset), cmd640, cs5530, hpt366
(HPT3xxN chipsets), rz1000, sc1200, dtc2278 and ht6560b host drivers).
* ->ack_intr method is not preserved (=> needed by ide-cris, buddha,
gayle and macide host drivers).
* ->sata_scr[] and sata_misc[] is cleared by ide_unregister() and it
isn't initialized again (SiI3112 support needs them).
* To issue an ioctl() there need to be at least one IDE device present
in the system.
* ->cable_detect method is not preserved + it is not called when probing
IDE devices so cable detection is broken (however since DMA support is
also broken it doesn't really matter ;-).
* Some objects which may have already been freed in ide_unregister()
are restored by ide_hwif_restore() (i.e. ->hwgroup).
* ide_register_hw() may unregister unrelated IDE ports if free ide_hwifs[]
slot cannot be found.
* When IDE host drivers are modular unregistered port may be re-used by
different host driver that owned it first causing subtle bugs.
Since we now have a proper warm-plug support remove these ioctls,
then remove no longer needed:
- ide_register_hw() and ide_hwif_restore() functions
- 'init_default' and 'restore' arguments of ide_unregister()
- zeroeing of hwif->{dma,extra}_* fields in ide_unregister()
As an added bonus IDE core code size shrinks by ~3kB (x86-32).
v2:
* fix ide_unregister() arguments in cleanup_module() (Andrew Morton).
v3:
* fix ide_unregister() arguments in palm_bk3710.c.
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
When switching scheduler from cfq, cfq_exit_queue() does not clear
ioc->ioc_data, leaving a dangling pointer that can deceive the following
lookups when the iosched is switched back to cfq. The pattern that can
trigger that is the following:
- elevator switch from cfq to something else;
- module unloading, with elv_unregister() that calls cfq_free_io_context()
on ioc freeing the cic (via the .trim op);
- module gets reloaded and the elevator switches back to cfq;
- reallocation of a cic at the same address as before (with a valid key).
To fix it just assign NULL to ioc_data in __cfq_exit_single_io_context(),
that is called from the regular exit path and from the elevator switching
code. The only path that frees a cic and is not covered is the error handling
one, but cic's freed in this way are never cached in ioc_data.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
SLAB_DESTROY_BY_RCU is not a direct substitute for normal call_rcu()
freeing, since it'll page freeing but NOT object freeing. So change
cfq to do the freeing on its own.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Looking a bit closer into this regression the reason this can't be
right is that dma_addr common default is BLK_BOUNCE_HIGH and most
machines have less than 4G. So if you do:
if (b_pfn <= (min_t(u64, 0xffffffff, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
dma = 1
that will translate to:
if (BLK_BOUNCE_HIGH <= BLK_BOUNCE_HIGH)
dma = 1
So for 99% of hardware this will trigger unnecessary GFP_DMA
allocations and isa pooling operations.
Also note how the 32bit code still does b_pfn < blk_max_low_pfn.
I guess this is what you were looking after. I didn't verify but as
far as I can tell, this will stop the regression with isa dma
operations at boot for 99% of blkdev/memory combinations out there and
I guess this fixes the setups with >4G of ram and 32bit pci cards as
well (this also retains symmetry with the 32bit code).
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fixes:
block/genhd.c:361: warning: ignoring return value of ‘class_register’, declared with attribute warn_unused_result
Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is important to eg dm, that tries to decide whether to stop using
barriers or not.
Tested as working by Anders Henke <anders.henke@1und1.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Introduced between 2.6.25-rc2 and -rc3
block/blk-map.c:154:14: warning: symbol 'bio' shadows an earlier one
block/blk-map.c:110:13: originally declared here
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Intoduced between 2.6.25-rc2 and -rc3
block/blk-settings.c:319:12: warning: function 'blk_queue_dma_drain' with external linkage has definition
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes the unused export of blk_rq_map_user_iov.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes the unused exports of blk_{get,put}_queue.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch contains the following cleanups:
- make the needlessly global struct disk_type static
- #if 0 the unused genhd_media_change_notify()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds a proper prototye for blk_dev_init() in block/blk.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Every file should include the headers containing the externs for its
global functions (in this case for __blk_queue_free_tags()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For some non-x86 systems with 4GB or upper 4GB memory,
we need increase the range of addresses that can be
used for direct DMA in 64-bit kernel.
Signed-off-by: Yang Shi <yang.shi@windriver.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Block layer alignment was used for two different purposes - memory
alignment and padding. This causes problems in lower layers because
drivers which only require memory alignment ends up with adjusted
rq->data_len. Separate out padding such that padding occurs iff
driver explicitly requests it.
Tomo: restorethe code to update bio in blk_rq_map_user
introduced by the commit 40b01b9bbd
according to padding alignment.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The meaning of rq->data_len was changed to the length of an allocated
buffer from the true data length. It breaks SG_IO friends and
bsg. This patch restores the meaning of rq->data_len to the true data
length and adds rq->extra_len to store an extended length (due to
drain buffer and padding).
This patch also removes the code to update bio in blk_rq_map_user
introduced by the commit 40b01b9bbd.
The commit adjusts bio according to memory alignment
(queue_dma_alignment). However, memory alignment is NOT padding
alignment. This adjustment also breaks SG_IO friends and bsg. Padding
alignment needs to be fixed in a proper way (by a separate patch).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
Clear drain buffer before chaining if the command in question is a
write.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Draining shouldn't be done for commands where overflow may indicate
data integrity issues. Add dma_drain_needed callback to
request_queue. Drain buffer is appened iff this function returns
non-zero.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
With padding and draining moved into it, block layer now may extend
requests as directed by queue parameters, so now a request has two
sizes - the original request size and the extended size which matches
the size of area pointed to by bios and later by sgs. The latter size
is what lower layers are primarily interested in when allocating,
filling up DMA tables and setting up the controller.
Both padding and draining extend the data area to accomodate
controller characteristics. As any controller which speaks SCSI can
handle underflows, feeding larger data area is safe.
So, this patch makes the primary data length field, request->data_len,
indicate the size of full data area and add a separate length field,
request->raw_data_len, for the unmodified request size. The latter is
used to report to higher layer (userland) and where the original
request size should be fed to the controller or device.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
DMA start address and transfer size alignment for PC requests are
achieved using bio_copy_user() instead of bio_map_user(). This works
because bio_copy_user() always uses full pages and block DMA alignment
isn't allowed to go over PAGE_SIZE.
However, the implementation didn't update the last bio of the request
to make this padding visible to lower layers. This patch makes
blk_rq_map_user() extend the last bio such that it includes the
padding area and the size of area pointed to by the request is
properly aligned.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently we fail if someone requests a valid io scheduler, but it's
modular and not currently loaded. That can happen from a driver init
asking for a different scheduler, or online switching through sysfs
as requested by a user.
This patch makes elevator_get() request_module() to attempt to load
the appropriate module, instead of requiring that done manually.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's cumbersome to browse a radix tree from start to finish, especially
since we modify keys when a process exits. So add a hlist for the single
purpose of browsing over all known cfq_io_contexts, used for exit,
io prio change, etc.
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=9948
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Rearrange fields in cache order and initialize some fields that
we didn't previously init. Remove init of ->completion_data, it's
part of a union with ->hash. Luckily clearing the rb node is the same
as setting it to null!
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It blindly copies everything in the io_context, including the lock.
That doesn't work so well for either lock ordering or lockdep.
There seems zero point in swapping io contexts on a request to request
merge, so the best point of action is to just remove it.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since it's acquired from irq context, all locking must be of the
irq safe variant. Most are already inside the queue lock (which
already disables interrupts), but the io scheduler rmmod path
always has irqs enabled and the put_io_context() path may legally
be called with irqs enabled (even if it isn't usually). So fixup
those two.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>