Commit graph

16014 commits

Author SHA1 Message Date
Jiro SEKIBA
565de406e7 nilfs2: expand inode_inc_link_count and inode_dec_link_count
This is an intermidiate patch to reduce redandunt mark_inode_dirty() calls
by calling inode_inc_link_count() and inode_dec_link_count() functions.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-27 20:05:15 +09:00
Jiro SEKIBA
43f8bc262f nilfs2: delete mark_inode_dirty from nilfs_set_link
Delete mark_inode_dirty() from nilfs_set_link() to reduce redundant
mark_inode_dirty() calls in caller of nilfs_set_link().

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-27 20:05:15 +09:00
Jiro SEKIBA
9ca941d4b6 nilfs2: delete mark_inode_dirty in nilfs_new_inode
It is redundant to call mark_inode_dirty() in nilfs_new_inode() because
all caller of nilfs_new_inode() will call mark_inode_dirty()
after calling nilfs_new_inode() directly or indirectly in transaction.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-27 20:05:15 +09:00
Hidetoshi Seto
d5b7c78e97 sched: Remove task_{u,s,g}time()
Now all task_{u,s}time() pairs are replaced by task_times().
And task_gtime() is too simple to be an inline function.

Cleanup them all.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
LKML-Reference: <4B0E16D1.70902@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-26 12:59:20 +01:00
Hidetoshi Seto
d180c5bcce sched: Introduce task_times() to replace task_{u,s}time() pair
Functions task_{u,s}time() are called in pair in almost all
cases.  However task_stime() is implemented to call task_utime()
from its inside, so such paired calls run task_utime() twice.

It means we do heavy divisions (div_u64 + do_div) twice to get
utime and stime which can be obtained at same time by one set
of divisions.

This patch introduces a function task_times(*tsk, *utime,
*stime) to retrieve utime and stime at once in better, optimized
way.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
LKML-Reference: <4B0E16AE.906@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-26 12:59:19 +01:00
Ingo Molnar
16bc67edeb Merge branch 'sched/urgent' into sched/core
Merge reason: Pick up fixes that did not make it into .32.0

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-26 10:50:42 +01:00
Vivek Goyal
d9449ce35a Fix regression in direct writes performance due to WRITE_ODIRECT flag removal
There seems to be a regression in direct write path due to following
commit in for-2.6.33 branch of block tree.

commit 1af60fbd75
Author: Jeff Moyer <jmoyer@redhat.com>
Date:   Fri Oct 2 18:56:53 2009 -0400

    block: get rid of the WRITE_ODIRECT flag

Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
more request from the queue and not idle on it (despite the fact that
queue's think time is less and it is not seeky).

So direct writers lose big time when competing with sequential readers.

Using fio, I have run one direct writer and two sequential readers and
following are the results with 2.6.32-rc7 kernel and with for-2.6.33
branch.

Test
====
1 direct writer and 2 sequential reader running simultaneously.

[global]
directory=/mnt/sdc/fio/
runtime=10

[seqwrite]
rw=write
size=4G
direct=1

[seqread]
rw=read
size=2G
numjobs=2

2.6.32-rc7
==========
direct writes: aggrb=2,968KB/s
readers	     : aggrb=101MB/s

for-2.6.33 branch
=================
direct write: aggrb=19KB/s
readers	      aggrb=137MB/s

This patch brings back the WRITE_ODIRECT flag, with the difference that we
don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
submission of request and an explicit unplug from submitter is required.

That way we fix the jeff's issue of not enough merging taking place in aio
path as well as make sure direct writes get their fair share.

After the fix
=============
for-2.6.33 + fix
----------------
direct writes: aggrb=2,728KB/s
reads: aggrb=103MB/s

Thanks
Vivek

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-26 09:46:46 +01:00
Ilya Loginov
2d4dc890b5 block: add helpers to run flush_dcache_page() against a bio and a request's pages
Mtdblock driver doesn't call flush_dcache_page for pages in request.  So,
this causes problems on architectures where the icache doesn't fill from
the dcache or with dcache aliases.  The patch fixes this.

The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
pointless empty cache-thrashing loops on architectures for which
flush_dcache_page() is a no-op.  Every architecture was provided with this
flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
equal 1 or do nothing otherwise.

See "fix mtd_blkdevs problem with caches on some architectures" discussion
on LKML for more information.

Signed-off-by: Ilya Loginov <isloginov@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Peter Horton <phorton@bitbox.co.uk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-26 09:16:19 +01:00
Steve French
2f81e752da [CIFS] Fix sparse warning
Also update CHANGES file

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-11-25 00:11:31 +00:00
Steve French
cea6234395 [CIFS] Duplicate data on appending to some Samba servers
SMB writes are sent with a starting offset and length. When the server
supports the newer SMB trans2 posix open (rather than using the SMB
NTCreateX) a file can be opened with SMB_O_APPEND flag, and for that
case Samba server assumes that the offset sent in SMBWriteX is unneeded
since the write should go to the end of the file - which can cause
problems if the write was cached (since the beginning part of a
page could be written twice by the client mm).  Jeff suggested that
masking the flag on posix open on the client is easiest for the time
being. Note that recent Samba server also had an unrelated problem with
SMB NTCreateX and append (see samba bugzilla bug number 6898) which
should not affect current Linux clients (unless cifs Unix Extensions
are disabled).

The cifs client did not send the O_APPEND flag on posix open
before 2.6.29 so the fix is unneeded on early kernels.

In the future, for the non-cached case (O_DIRECT, and forcedirectio mounts)
it would be possible and useful to send O_APPEND on posix open (for Windows
case: FILE_APPEND_DATA but not FILE_WRITE_DATA on SMB NTCreateX) but for
cached writes although the vfs sets the offset to end of file it
may fragment a write across pages - so we can't send O_APPEND on
open (could result in sending part of a page twice).

CC: Stable <stable@kernel.org>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-11-24 22:52:13 +00:00
Steve French
8e6c0332d5 [CIFS] fix oops in cifs_lookup during net boot
Fixes bugzilla.kernel.org bug number 14641

Lookup called during network boot (network root filesystem
for diskless workstation) has case where nd is null in
lookup.  This patch fixes that in cifs_lookup.

(Shirish noted that 2.6.30 and 2.6.31 stable need the same check)

Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Tested-by:  Vladimir Stavrinov <vs@inist.ru>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-11-24 22:17:59 +00:00
Wu Fengguang
3f0ca30985 ext4: remove unused parameter wbc from __ext4_journalled_writepage()
CC: Jan Kara <jack@suse.cz> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 11:15:44 -05:00
Akira Fujita
ac48b0a1d0 ext4: move_extent_per_page() cleanup
Integrate duplicate lines (acquire/release semaphore and invalidate
extent cache in move_extent_per_page()) into mext_replace_branches(),
to reduce source and object code size.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 10:31:56 -05:00
Kazuya Mio
446aaa6e7e ext4: initialize moved_len before calling ext4_move_extents()
The move_extent.moved_len is used to pass back the number of exchanged
blocks count to user space.  Currently the caller must clear this
field; but we spend more code space checking for this requirement than
simply zeroing the field ourselves, so let's just make life easier for
everyone all around.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 10:28:48 -05:00
Akira Fujita
94d7c16cbb ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
At the beginning of ext4_move_extent(), we call
ext4_discard_preallocations() to discard inode PAs of orig and donor
inodes.  But in the following case, blocks can be double freed, so
move ext4_discard_preallocations() to the end of ext4_move_extents().

1. Discard inode PAs of orig and donor inodes with
   ext4_discard_preallocations() in ext4_move_extents().

   orig : [ DATA1 ]
   donor: [ DATA2 ]

2. While data blocks are exchanging between orig and donor inodes, new
   inode PAs is created to orig by other process's block allocation.
   (Since there are semaphore gaps in ext4_move_extents().)  And new
   inode PAs is used partially (2-1).

   2-1 Create new inode PAs to orig inode
   orig : [ DATA1 | used PA1 | free PA1 ]
   donor: [ DATA2 ]

3. Donor inode which has old orig inode's blocks is deleted after
   EXT4_IOC_MOVE_EXT finished (3-1, 3-2).  So the block bitmap
   corresponds to old orig inode's blocks are freed.

   3-1 After EXT4_IOC_MOVE_EXT finished
   orig : [ DATA2 |  free PA1 ]
   donor: [ DATA1 |  used PA1 ]

   3-2 Delete donor inode
   orig : [ DATA2 |  free PA1 ]
   donor: [ FREE SPACE(DATA1) | FREE SPACE(used PA1) ]

4. The double-free of blocks is occurred, when close() is called to
   orig inode.  Because ext4_discard_preallocations() for orig inode
   frees used PA1 and free PA1, though used PA1 is already freed in 3.

   4-1 Double-free of blocks is occurred
   orig : [ DATA2 |  FREE SPACE(free PA1) ]
   donor: [ FREE SPACE(DATA1) | DOUBLE FREE(used PA1) ]

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 10:19:57 -05:00
Christoph Hellwig
774888bcd6 UBIFS: remove manual O_SYNC handling
generic_file_aio_write already calls into ->fsync to handle O_SYNC/O_DSYNC.
Remove the duplicate call to ubifs_sync_wbufs_by_inode which is already
covered by ubifs_fsync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-11-24 08:18:55 +02:00
Corentin Chary
9722324e65 UBIFS: support mounting of UBI volume character devices
This patch makes it possible to mount UBI character device
nodes, and use something like:

$ mount -t ubifs /dev/ubi_volume_name /mnt/ubifs

instead of the old restrictive 'nodev' semantics:

$ mount -t ubifs ubi0_0 /mnt/ubifs

[Comments and the patch were amended a bit by Artem]

Signed-off-by: Corentin Chary <corentincj@iksaif.net>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-11-24 08:18:54 +02:00
Tetsuo Handa
fe542cf59b LSM: Move security_path_chmod()/security_path_chown() to after mutex_lock().
We should call security_path_chmod()/security_path_chown() after mutex_lock()
in order to avoid races.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-11-24 08:49:26 +11:00
Karel Zak
87038c2d5b partitions: read whole sector with EFI GPT header
The size of EFI GPT header is not static, but whole sector is
allocated for the header. The HeaderSize field must be greater
than 92 (= sizeof(struct gpt_header) and must be less than or
equal to the logical block size.

It means we have to read whole sector with the header, because the
header crc32 checksum is calculated according to HeaderSize.

For more details see UEFI standard (version 2.3, May 2009):
  - 5.3.1 GUID Format overview, page 93
  - Table 13. GUID Partition Table Header, page 96

Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-23 09:29:58 +01:00
Karel Zak
7d13af3279 partitions: use sector size for EFI GPT
Currently, kernel uses strictly 512-byte sectors for EFI GPT parsing.
That's wrong.

UEFI standard (version 2.3, May 2009, 5.3.1 GUID Format overview, page
95) defines that LBA is always based on the logical block size. It
means bdev_logical_block_size() (aka BLKSSZGET) for Linux.

This patch removes static sector size from EFI GPT parser.

The problem is reproducible with the latest GNU Parted:

 # modprobe scsi_debug dev_size_mb=50 sector_size=4096

  # ./parted /dev/sdb print
  Model: Linux scsi_debug (scsi)
  Disk /dev/sdb: 52.4MB
  Sector size (logical/physical): 4096B/4096B
  Partition Table: gpt

  Number  Start   End     Size    File system  Name     Flags
   1      24.6kB  3002kB  2978kB               primary
   2      3002kB  6001kB  2998kB               primary
   3      6001kB  9003kB  3002kB               primary

  # blockdev --rereadpt /dev/sdb
  # dmesg | tail -1
   sdb: unknown partition table      <---- !!!

with this patch:

  # blockdev --rereadpt /dev/sdb
  # dmesg | tail -1
   sdb: sdb1 sdb2 sdb3

Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-23 09:29:13 +01:00
Theodore Ts'o
9084d47197 ext4: use ext4_data_block_valid() in ext4_free_blocks()
The block validity framework does a more comprehensive set of checks,
and it saves object code space to use the ext4_data_block_valid() than
the limited open-coded version that had been in ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 20:48:42 -05:00
Theodore Ts'o
1585d8d89a ext4: add check for wraparound in ext4_data_block_valid()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 20:48:34 -05:00
Theodore Ts'o
e6362609b6 ext4: call ext4_forget() from ext4_free_blocks()
Add the facility for ext4_forget() to be called from
ext4_free_blocks().  This simplifies the code in a large number of
places, and centralizes most of the work of calling ext4_forget() into
a single place.

Also fix a bug in the extents migration code; it wasn't calling
ext4_forget() when releasing the indirect blocks during the
conversion.  As a result, if the system cashed during or shortly after
the extents migration, and the released indirect blocks get reused as
data blocks, the journal replay would corrupt the data blocks.  With
this new patch, fixing this bug was as simple as adding the
EXT4_FREE_BLOCKS_FORGET flags to the call to ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-11-23 07:17:05 -05:00
Theodore Ts'o
4433871130 ext4: fold ext4_free_blocks() and ext4_mb_free_blocks()
ext4_mb_free_blocks() is only called by ext4_free_blocks(), and the
latter function doesn't really do much.  So merge the two functions
together, such that ext4_free_blocks() is now found in
fs/ext4/mballoc.c.  This saves about 200 bytes of compiled text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 07:44:56 -05:00
Theodore Ts'o
b7e57e7c2a ext4: fold ext4_journal_forget() into ext4_forget()
Convert the last two callers of ext4_journal_forget() to use
ext4_forget() instead, and then fold ext4_journal_forget() into
ext4_forget().  This reduces are code complexity and shortens our call
stack.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 21:00:13 -05:00
Theodore Ts'o
e4684b3fbb ext4: fold ext4_journal_revoke() into ext4_forget()
The only caller of ext4_journal_revoke() is ext4_forget(), so we can
fold ext4_journal_revoke() into ext4_forget() to simplify the code and
shorten the call stack.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 11:05:59 -05:00
Theodore Ts'o
d6797d14b1 ext4: move ext4_forget() to ext4_jbd2.c
The ext4_forget() function better belongs in ext4_jbd2.c.  This will
allow us to do some cleanup of the ext4_journal_revoke() and
ext4_journal_forget() functions, as well as giving us better error
reporting since we can report the caller of ext4_forget() when things
go wrong.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 20:52:12 -05:00
David Howells
4fa9f4ede8 FS-Cache: Provide nop fscache_stat_d() if CONFIG_FSCACHE_STATS=n
Provide nop fscache_stat_d() macro if CONFIG_FSCACHE_STATS=n lest errors like
the following occur:

	fs/fscache/cache.c: In function 'fscache_withdraw_cache':
	fs/fscache/cache.c:386: error: implicit declaration of function 'fscache_stat_d'
	fs/fscache/cache.c:386: error: 'fscache_n_cop_sync_cache' undeclared (first use in this function)
	fs/fscache/cache.c:386: error: (Each undeclared identifier is reported only once
	fs/fscache/cache.c:386: error: for each function it appears in.)
	fs/fscache/cache.c:392: error: 'fscache_n_cop_dissociate_pages' undeclared (first use in this function)

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-20 21:50:44 +00:00
David Howells
1c2ea8a2c0 SLOW_WORK: Fix GFS2 to #include <linux/module.h> before using THIS_MODULE
GFS2 has been altered to pass THIS_MODULE to slow_work_register_user(), but
hasn't been altered to #include <linux/module.h> to provide it, resulting in
the following error:

	fs/gfs2/recovery.c:596: error: 'THIS_MODULE' undeclared here (not in a function)

Add the missing #include.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-20 21:50:40 +00:00
David Howells
0109d7e614 SLOW_WORK: Fix CIFS to pass THIS_MODULE to slow_work_register_user()
As of the patch:

	SLOW_WORK: Wait for outstanding work items belonging to a module to clear

	Wait for outstanding slow work items belonging to a module to clear
	when unregistering that module as a user of the facility.  This
	prevents the put_ref code of a work item from being taken away before
	it returns.

slow_work_register_user() takes a module pointer as an argument.  CIFS must now
pass THIS_MODULE as that argument, lest the following error be observed:

	fs/cifs/cifsfs.c: In function 'init_cifs':
	fs/cifs/cifsfs.c:1040: error: too few arguments to function 'slow_work_register_user'

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-20 21:50:36 +00:00
Frederic Weisbecker
1d2c6cfd40 kill-the-bkl/reiserfs: turn GFP_ATOMIC flag to GFP_NOFS in reiserfs_get_block()
GFP_ATOMIC was used in reiserfs_get_block to not lose the Bkl so that
nobody can modify the tree in the middle of its work. Now that we
kicked out the bkl, we can use a more friendly flag. We use GFP_NOFS
here because we already hold the reiserfs lock.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
Cc: Thomas Gleixner <tglx@linutronix.de>
2009-11-20 18:25:02 +01:00
Linus Torvalds
931ed94430 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Trivial cleanup of jbd compatibility layer removal
  ocfs2: Refresh documentation
  ocfs2: return f_fsid info in ocfs2_statfs()
  ocfs2: duplicate inline data properly during reflink.
  ocfs2: Move ocfs2_complete_reflink to the right place.
  ocfs2: Return -EINVAL when a device is not ocfs2.
2009-11-19 20:29:05 -08:00
Ryusuke Konishi
0234576d04 nilfs2: add norecovery mount option
This adds "norecovery" mount option which disables temporal write
access to read-only mounts or snapshots during mount/recovery.
Without this option, write access will be even performed for those
types of mounts; the temporal write access is needed to mount root
file system read-only after an unclean shutdown.

This option will be helpful when user wants to prevent any write
access to the device.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Eric Sandeen <sandeen@redhat.com>
2009-11-20 10:05:52 +09:00
Ryusuke Konishi
a057d2c011 nilfs2: add helper to get if volume is in a valid state
This adds a helper function, nilfs_valid_fs() which returns if nilfs
is in a valid state or not.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:52 +09:00
Ryusuke Konishi
f50a4c8149 nilfs2: move recovery completion into load_nilfs function
Although mount recovery of nilfs is integrated in load_nilfs()
procedure, the completion of recovery was isolated from the procedure
and performed at the end of the fill_super routine.

This was somewhat confusing since the recovery is needed for the nilfs
object, not for a super block instance.

To resolve the inconsistency, this will integrate the recovery
completion into load_nilfs().

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:52 +09:00
Ryusuke Konishi
050b4142c9 nilfs2: apply readahead for recovery on mount
This inserts readahead in the recovery code.  The readahead request is
issued per segment while searching the latest super root block.

This will shorten mount time after unclean unmount.  A measurement
shows the recovery time was reduced by more than 60 percent:

 e.g. real  0m11.586s -> 0m3.918s  (x 2.96)

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:51 +09:00
Ryusuke Konishi
f021759d74 nilfs2: clean up get/put function of a segment usage
This eliminates obsolete nilfs_get_sufile_get_segment_usage() and
nilfs_set_sufile_segment_usage() from sufile.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:51 +09:00
Ryusuke Konishi
071ec54dd7 nilfs2: move routine to set segment usage into sufile
This adds nilfs_sufile_set_segment_usage() function in sufile to
replace direct access to the sufile metadata in log writer code.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:51 +09:00
Ryusuke Konishi
61a189e9c6 nilfs2: move routine marking segment usage dirty into sufile
This adds nilfs_sufile_mark_dirty() function in sufile to replace
nilfs_touch_segusage() function in log writer code.  This is a
preparation for the further cleanup which will move out low level
sufile operations in the log writer.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:51 +09:00
Ryusuke Konishi
70622a2091 nilfs2: insert cache operation in palloc get block routines
This implements cache operation in get block routines of palloc code:
nilfs_palloc_get_desc_block(), nilfs_palloc_get_bitmap_block(), and
nilfs_palloc_get_entry_block().

This will complete the palloc cache.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:51 +09:00
Ryusuke Konishi
49fa7a5902 nilfs2: add palloc cache to ifile
This adds the palloc cache to ifile.  The palloc cache is allocated on
the extended region of nilfs_mdt_info struct.  The struct
nilfs_ifile_info defines the extended on memory structure of ifile.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:50 +09:00
Ryusuke Konishi
c3ea56c800 nilfs2: flush palloc cache before manipulating data pages of GC dat
Data pages in gcdat metadata file (i.e. the secondary DAT for GC), are
cleared or even moved back to the normal DAT when a shot of garbage
collection was done.

Buffer heads held by the palloc cache of gcdat must be cleared before
these page cache manipulation.  This adds nilfs_palloc_clear_cache()
to ensure this.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:50 +09:00
Ryusuke Konishi
8908b2f70b nilfs2: add palloc cache to dat
This adds the palloc cache to DAT file.  The palloc cache is allocated
on the extended region of nilfs_mdt_info struct.  The struct
nilfs_dat_info defines the extended on memory structure of DAT.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:50 +09:00
Ryusuke Konishi
db38d5ad32 nilfs2: add cache framework for persistent object allocator
This adds setup and cleanup routines of the persistent object
allocator cache.

According to ftrace analyses, accessing buffers of the DAT file
suffers indispensable overhead many times.  To mitigate the overhead,
This introduce cache framework for the persistent object allocator
(palloc) which the DAT file and ifile are using.

struct nilfs_palloc_cache represents the cache object per metadata
file using palloc.

The cache is initialized through nilfs_palloc_setup_cache() and
destroyed by nilfs_palloc_destroy_cache(); callers of the former
function will be added to individual allocators of DAT and ifile on
successive patches.

nilfs_palloc_destroy_cache() will be called from nilfs_mdt_destroy()
if the cache is attached to a metadata file.  A companion function
nilfs_palloc_clear_cache() is provided to allow releasing buffer head
references independently with the cleanup task.  This adjunctive
function will be used before invalidating pages of metadata file with
the cache.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:50 +09:00
Ryusuke Konishi
141bbdba9c nilfs2: unfold nilfs_palloc_block_get_bitmap function
This expands a trivial address calculation in the function into its
every callsite. This expansion improves readability of the callers.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:50 +09:00
Ryusuke Konishi
1376e931b7 nilfs2: eliminate nilfs_btnode_get function
This removes the obsolete nilfs_btnode_get() function and makes
nilfs_btree_get_block() directly call nilfs_btnode_submit_block().

This expansion will provide better opportunity for code optimization.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:50 +09:00
Ryusuke Konishi
75f65edfcc nilfs2: remove newblk argument from nilfs_btnode_submit_block
This removes the obsolete argument from nilfs_btnode_submit_block().
This will complete separating a create function of btree node.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:50 +09:00
Ryusuke Konishi
45f4910bc0 nilfs2: use nilfs_btnode_create_block function
This displaces nilfs_btnode_get() use to create new btree node block
with nilfs_btnode_create_block.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:49 +09:00
Ryusuke Konishi
d501d73689 nilfs2: separate function for creating new btree node block
Adds a separate routine for creating a btree node block.  This is a
preparation to reduce the depth of function calls during submitting
btree node buffer.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:49 +09:00
Ryusuke Konishi
b34a65069c nilfs2: avoid readahead on metadata file for create mode
This turns off readhead action of metadata file if nilfs_mdt_get_block
function was called with a create flag.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:49 +09:00
Ryusuke Konishi
ef7d4757a5 nilfs2: simplify nilfs_sufile_get_ncleansegs function
Previously, this function took an status code to return possible error
codes.  The ("nilfs2: add local variable to cache the number of clean
segments") patch removed the possibility to return errors.

So, this simplifies the function definition to make it directly return
the number of clean segments.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:49 +09:00
Ryusuke Konishi
aa474a2201 nilfs2: add local variable to cache the number of clean segments
This makes it possible for sufile to get the number of clean segments
faster.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:49 +09:00
Ryusuke Konishi
7b16c8a211 nilfs2: unfold nilfs_sufile_block_get_header function
This unfolds the nilfs_sufile_block_get_header() function for
simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:48 +09:00
Ryusuke Konishi
fd66c0d5c3 nilfs2: hide nilfs_mdt_clear calls in nilfs_mdt_destroy
This will hide a function call of nilfs_mdt_clear() in
nilfs_mdt_destroy().

This ensures nilfs_mdt_destroy() to do cleanup jobs included in
nilfs_mdt_clear().

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:48 +09:00
Ryusuke Konishi
3961f0e277 nilfs2: eliminate inlines to directly read/write inode of metadata files
Removes two inline functions: nilfs_mdt_read_inode_direct() and
nilfs_mdt_write_inode_direct().

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:48 +09:00
Ryusuke Konishi
8707df3847 nilfs2: separate read method of meta data files on super root block
Will displace nilfs_mdt_read_inode_direct function with an individual
read method: nilfs_dat_read, nilfs_sufile_read, nilfs_cpfile_read.

This provides the opportunity to initialize local variables of each
metadata file after reading the inode.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:48 +09:00
Ryusuke Konishi
79739565e1 nilfs2: separate constructor of metadata files
This will displace nilfs_mdt_new() constructor with individual
metadata file constructors like nilfs_dat_new(), new_sufile_new(),
nilfs_cpfile_new(), and nilfs_ifile_new().

This makes it possible for each metadata file to have own
intialization code.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:48 +09:00
Ryusuke Konishi
5731e191f2 nilfs2: add size option of private object to metadata file allocator
This adds an optional "object size" argument to nilfs_mdt_new_common()
function; the argument specifies the size of private object attached
to a newly allocated metadata file inode.

This will afford space to keep local variables for meta data files.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:48 +09:00
Ryusuke Konishi
9cb4e0d2b9 nilfs2: move out mark_inode_dirty calls from bmap routines
Previously, nilfs_bmap_add_blocks() and nilfs_bmap_sub_blocks() called
mark_inode_dirty() after they changed the number of data blocks.

This moves these calls outside bmap outermost functions like
nilfs_bmap_insert() or nilfs_bmap_truncate().

This will mitigate overhead for truncate or delete operation since
they repeatedly remove set of blocks.  Nearly 10 percent improvement
was observed for removal of a large file:

 # dd if=/dev/zero of=/test/aaa bs=1M count=512
 # time rm /test/aaa

  real  2.968s -> 2.705s

Further optimization may be possible by eliminating these
mark_inode_dirty() uses though I avoid mixing separate changes here.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:47 +09:00
Ryusuke Konishi
09bf4aae0a nilfs2: stop marking metadata inode dirty within btree operations
Since metadata file routines mark the inode dirty after they
successfully changed bmap objects, nilfs_mdt_mark_dirty() calls in
nilfs_bmap_add_blocks() and nilfs_bmap_sub_blocks() are redundant.

This removes these overlapping calls from the bmap routines.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:47 +09:00
Ryusuke Konishi
30db4e6c3d nilfs2: remove buffer locking from btree code
lock_buffer() and unlock_buffer() uses in btree.c are eliminable
because btree functions gain buffer heads through nilfs_btnode_get(),
which never returns an on-the-fly buffer.

Although nilfs_clear_dirty_page() and nilfs_copy_back_pages() in
nilfs_commit_gcdat_inode() juggle btree node buffers of DAT, this is
safe because these operations are protected by a log writer lock or
the metadata file semaphore of DAT.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:47 +09:00
Ryusuke Konishi
a49762fd11 nilfs2: remove buffer locking in nilfs_mark_inode_dirty
This lock is eliminable because inodes on the buffer can be updated
independently.  Although a log writer also fills in bmap data on the
on-disk inodes, this update is exclusively done by a log writer lock.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:47 +09:00
Jiro SEKIBA
e2073e7857 nilfs2: cleanup unused match_bool function
match_bool function is not used anymore.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:47 +09:00
Jiro SEKIBA
91f1953bf3 nilfs2: Using nobarrier option instead of barrier=off
Since most of fs using nofoobar style option,
modified barrier=off option as nobarrier.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:47 +09:00
Jiro SEKIBA
6600b9dd8e nilfs2: move definition of struct nilfs_btree_node
This is a trivial patch to expose struct nilfs_fs_btree_node.
The struct should be exposed outside of kernel, for it is disk format.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:46 +09:00
Ryusuke Konishi
9b945d537d nilfs2: get rid of BUG_ON use in btree lookup routines
The current btree lookup routines make a kernel oops when detected
inconsistency in btree blocks.  These routines should instead return a
proper error code because the inconsistency usually comes from
corruption of on-disk metadata.

This fixes the issue by converting BUG_ON calls to proper error
handlings.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-20 10:05:46 +09:00
Linus Torvalds
e6236f781c Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  SUNRPC: Address buffer overrun in rpc_uaddr2sockaddr()
  NFSv4: Fix a cache validation bug which causes getcwd() to return ENOENT
2009-11-19 13:43:19 -08:00
Eric Sandeen
e3bb52ae2b ext4: make "norecovery" an alias for "noload"
Users on the linux-ext4 list recently complained about differences
across filesystems w.r.t. how to mount without a journal replay.

In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext4.

Also show this status in /proc/mounts

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:28:50 -05:00
Eric Sandeen
5328e63531 ext4: make trim/discard optional (and off by default)
It is anticipated that when sb_issue_discard starts doing
real work on trim-capable devices, we may see issues.  Make
this mount-time optional, and default it to off until we know
that things are working out OK.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:25:42 -05:00
Jan Kara
2bba702d4f ext4: fix error handling in ext4_ind_get_blocks()
When an error happened in ext4_splice_branch we failed to notice that
in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem
by checking for error properly.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:48 -05:00
Theodore Ts'o
6b17d902fd ext4: avoid issuing unnecessary barriers
We don't to issue an I/O barrier on an error or if we force commit
because we are doing data journaling.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2009-11-23 07:24:57 -05:00
David Howells
14e69647c8 CacheFiles: Don't log lookup/create failing with ENOBUFS
Don't log the CacheFiles lookup/create object routined failing with ENOBUFS as
under high memory load or high cache load they can do this quite a lot.  This
error simply means that the requested object cannot be created on disk due to
lack of space, or due to failure of the backing filesystem to find sufficient
resources.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:12:08 +00:00
David Howells
fee096deb4 CacheFiles: Catch an overly long wait for an old active object
Catch an overly long wait for an old, dying active object when we want to
replace it with a new one.  The probability is that all the slow-work threads
are hogged, and the delete can't get a look in.

What we do instead is:

 (1) if there's nothing in the slow work queue, we sleep until either the dying
     object has finished dying or there is something in the slow work queue
     behind which we can queue our object.

 (2) if there is something in the slow work queue, we return ETIMEDOUT to
     fscache_lookup_object(), which then puts us back on the slow work queue,
     presumably behind the deletion that we're blocked by.  We are then
     deferred for a while until we work our way back through the queue -
     without blocking a slow-work thread unnecessarily.

A backtrace similar to the following may appear in the log without this patch:

	INFO: task kslowd004:5711 blocked for more than 120 seconds.
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	kslowd004     D 0000000000000000     0  5711      2 0x00000080
	 ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
	 ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
	 000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
	Call Trace:
	 [<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
	 [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
	 [<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles]
	 [<ffffffff81353153>] __wait_on_bit+0x43/0x76
	 [<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270
	 [<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74
	 [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
	 [<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e
	 [<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
	 [<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
	 [<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
	 [<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
	 [<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
	 [<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
	 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
	 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
	 [<ffffffff8104be91>] kthread+0x7a/0x82
	 [<ffffffff8100beda>] child_rip+0xa/0x20
	 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
	 [<ffffffff8104be17>] ? kthread+0x0/0x82
	 [<ffffffff8100bed0>] ? child_rip+0x0/0x20
	1 lock held by kslowd004/5711:
	 #0:  (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:12:05 +00:00
David Howells
d0e27b7808 CacheFiles: Better showing of debugging information in active object problems
Show more debugging information if cachefiles_mark_object_active() is asked to
activate an active object.

This may happen, for instance, if the netfs tries to register an object with
the same key multiple times.

The code is changed to (a) get the appropriate object lock to protect the
cookie pointer whilst we dereference it, and (b) get and display the cookie key
if available.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:12:02 +00:00
David Howells
6511de33c8 CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy
Mark parent directory locks as I_MUTEX_PARENT in the callers of
cachefiles_bury_object() so that lockdep doesn't complain when that invokes
vfs_unlink():

=============================================
[ INFO: possible recursive locking detected ]
2.6.32-rc6-cachefs #47
---------------------------------------------
kslowd002/3089 is trying to acquire lock:
 (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffff810bbf72>] vfs_unlink+0x8b/0x128

but task is already holding lock:
 (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffffa00e4e61>] cachefiles_walk_to_object+0x1b0/0x831 [cachefiles]

other info that might help us debug this:
1 lock held by kslowd002/3089:
 #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffffa00e4e61>] cachefiles_walk_to_object+0x1b0/0x831 [cachefiles]

stack backtrace:
Pid: 3089, comm: kslowd002 Not tainted 2.6.32-rc6-cachefs #47
Call Trace:
 [<ffffffff8105ad7b>] __lock_acquire+0x1649/0x16e3
 [<ffffffff8118170e>] ? inode_has_perm+0x5f/0x61
 [<ffffffff8105ae6c>] lock_acquire+0x57/0x6d
 [<ffffffff810bbf72>] ? vfs_unlink+0x8b/0x128
 [<ffffffff81353ac3>] mutex_lock_nested+0x54/0x292
 [<ffffffff810bbf72>] ? vfs_unlink+0x8b/0x128
 [<ffffffff8118179e>] ? selinux_inode_permission+0x8e/0x90
 [<ffffffff8117e271>] ? security_inode_permission+0x1c/0x1e
 [<ffffffff810bb4fb>] ? inode_permission+0x99/0xa5
 [<ffffffff810bbf72>] vfs_unlink+0x8b/0x128
 [<ffffffff810adb19>] ? kfree+0xed/0xf9
 [<ffffffffa00e3f00>] cachefiles_bury_object+0xb6/0x420 [cachefiles]
 [<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa00e7e24>] ? cachefiles_check_object_xattr+0x233/0x293 [cachefiles]
 [<ffffffffa00e51b0>] cachefiles_walk_to_object+0x4ff/0x831 [cachefiles]
 [<ffffffff81032238>] ? finish_task_switch+0x0/0xb2
 [<ffffffffa00e3429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
 [<ffffffffa00741e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
 [<ffffffffa0074fc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
 [<ffffffffa00754ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
 [<ffffffff8104be91>] kthread+0x7a/0x82
 [<ffffffff8100beda>] child_rip+0xa/0x20
 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
 [<ffffffff8104be17>] ? kthread+0x0/0x82
 [<ffffffff8100bed0>] ? child_rip+0x0/0x20

Signed-off-by: Daivd Howells <dhowells@redhat.com>
2009-11-19 18:11:58 +00:00
David Howells
5e929b33c3 CacheFiles: Handle truncate unlocking the page we're reading
Handle truncate unlocking the page we're attempting to read from the backing
device before the read has completed.

This was causing reports like the following to occur:

	Pid: 4765, comm: kslowd Not tainted 2.6.30.1 #1
	Call Trace:
	 [<ffffffffa0331d7a>] ? cachefiles_read_waiter+0xd9/0x147 [cachefiles]
	 [<ffffffff804b74bd>] ? __wait_on_bit+0x60/0x6f
	 [<ffffffff8022bbbb>] ? __wake_up_common+0x3f/0x71
	 [<ffffffff8022cc32>] ? __wake_up+0x30/0x44
	 [<ffffffff8024a41f>] ? __wake_up_bit+0x28/0x2d
	 [<ffffffffa003a793>] ? ext3_truncate+0x4d7/0x8ed [ext3]
	 [<ffffffff80281f90>] ? pagevec_lookup+0x17/0x1f
	 [<ffffffff8028c2ff>] ? unmap_mapping_range+0x59/0x1ff
	 [<ffffffff8022cc32>] ? __wake_up+0x30/0x44
	 [<ffffffff8028e286>] ? vmtruncate+0xc2/0xe2
	 [<ffffffff802b82cf>] ? inode_setattr+0x22/0x10a
	 [<ffffffffa003baa5>] ? ext3_setattr+0x17b/0x1e6 [ext3]
	 [<ffffffff802b853d>] ? notify_change+0x186/0x2c9
	 [<ffffffffa032d9de>] ? cachefiles_attr_changed+0x133/0x1cd [cachefiles]
	 [<ffffffffa032df7f>] ? cachefiles_lookup_object+0xcf/0x12a [cachefiles]
	 [<ffffffffa0318165>] ? fscache_lookup_object+0x110/0x122 [fscache]
	 [<ffffffffa03188c3>] ? fscache_object_slow_work_execute+0x590/0x6bc
	[fscache]
	 [<ffffffff80278f82>] ? slow_work_thread+0x285/0x43a
	 [<ffffffff8024a446>] ? autoremove_wake_function+0x0/0x2e
	 [<ffffffff80278cfd>] ? slow_work_thread+0x0/0x43a
	 [<ffffffff8024a317>] ? kthread+0x54/0x81
	 [<ffffffff8020c93a>] ? child_rip+0xa/0x20
	 [<ffffffff8024a2c3>] ? kthread+0x0/0x81
	 [<ffffffff8020c930>] ? child_rip+0x0/0x20
	CacheFiles: I/O Error: Readpage failed on backing file 200000000000810
	FS-Cache: Cache cachefiles stopped due to I/O error

Reported-by: Christian Kujau <lists@nerdbynature.de>
Reported-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Duc Le Minh <duclm.vn@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:55 +00:00
David Howells
a17754fb8c CacheFiles: Don't write a full page if there's only a partial page to cache
cachefiles_write_page() writes a full page to the backing file for the last
page of the netfs file, even if the netfs file's last page is only a partial
page.

This causes the EOF on the backing file to be extended beyond the EOF of the
netfs, and thus the backing file will be truncated by cachefiles_attr_changed()
called from cachefiles_lookup_object().

So we need to limit the write we make to the backing file on that last page
such that it doesn't push the EOF too far.

Also, if a backing file that has a partial page at the end is expanded, we
discard the partial page and refetch it on the basis that we then have a hole
in the file with invalid data, and should the power go out...  A better way to
deal with this could be to record a note that the partial page contains invalid
data until the correct data is written into it.

This isn't a problem for netfs's that discard the whole backing file if the
file size changes (such as NFS).

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:52 +00:00
David Howells
868411be3f FS-Cache: Actually requeue an object when requested
FS-Cache objects have an FSCACHE_OBJECT_EV_REQUEUE event that can theoretically
be raised to ask the state machine to requeue the object for further processing
before the work function returns to the slow-work facility.

However, fscache_object_work_execute() was clearing that bit before checking
the event mask to see whether the object has any pending events that require it
to be requeued immediately.

Instead, the bit should be cleared after the check and enqueue.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:48 +00:00
David Howells
60d543ca72 FS-Cache: Start processing an object's operations on that object's death
Start processing an object's operations when that object moves into the DYING
state as the object cannot be destroyed until all its outstanding operations
have completed.

Furthermore, make sure that read and allocation operations handle being woken
up on a dead object.  Such events are recorded in the Allocs.abt and
Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.

The code for waiting for object activation for the read and allocation
operations is also extracted into its own function as it is much the same in
all cases, differing only in the stats incremented.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:45 +00:00
David Howells
d461d26dde FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure
We must make sure that FSCACHE_COOKIE_LOOKING_UP is cleared on lookup failure
(if an object reaches the LC_DYING state), and we should clear it before
clearing FSCACHE_COOKIE_CREATING.

If this doesn't happen then fscache_wait_for_deferred_lookup() may hold
allocation and retrieval operations indefinitely until they're interrupted by
signals - which in turn pins the dying object until they go away.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:41 +00:00
David Howells
2175bb06dc FS-Cache: Add a retirement stat counter
Add a stat counter to count retirement events rather than ordinary release
events (the retire argument to fscache_relinquish_cookie()).

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:38 +00:00
David Howells
201a15428b FS-Cache: Handle pages pending storage that get evicted under OOM conditions
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache.  Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.

The problem is typified by the following trace of a stuck process:

	kslowd005     D 0000000000000000     0  4253      2 0x00000080
	 ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
	 0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
	 000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
	Call Trace:
	 [<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
	 [<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
	 [<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
	 [<ffffffff810885d3>] try_to_release_page+0x32/0x3b
	 [<ffffffff81093203>] shrink_page_list+0x316/0x4ac
	 [<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
	 [<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
	 [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
	 [<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
	 [<ffffffff81093aa2>] shrink_list+0x8d/0x8f
	 [<ffffffff81093d1c>] shrink_zone+0x278/0x33c
	 [<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
	 [<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
	 [<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
	 [<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
	 [<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
	 [<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
	 [<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
	 [<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
	 [<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
	 [<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
	 [<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
	 [<ffffffff810b2e82>] do_sync_write+0xe3/0x120
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
	 [<ffffffff810b1a76>] ? dentry_open+0x82/0x89
	 [<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
	 [<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
	 [<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
	 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
	 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
	 [<ffffffff8104be91>] kthread+0x7a/0x82
	 [<ffffffff8100beda>] child_rip+0xa/0x20
	 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
	 [<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
	 [<ffffffff8104be17>] ? kthread+0x0/0x82
	 [<ffffffff8100bed0>] ? child_rip+0x0/0x20

In the above backtrace, the following is happening:

 (1) A page storage operation is being executed by a slow-work thread
     (fscache_write_op()).

 (2) FS-Cache farms the operation out to the cache to perform
     (cachefiles_write_page()).

 (3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
     standard write (do_sync_write()) under KERNEL_DS directly from the netfs
     page.

 (4) However, for Ext3 to perform the write, it must allocate some memory, in
     particular, it must allocate at least one page cache page into which it
     can copy the data from the netfs page.

 (5) Under OOM conditions, the memory allocator can't immediately come up with
     a page, so it uses vmscan to find something to discard
     (try_to_free_pages()).

 (6) vmscan finds a clean netfs page it might be able to discard (possibly the
     one it's trying to write out).

 (7) The netfs is called to throw the page away (nfs_release_page()) - but it's
     called with __GFP_WAIT, so the netfs decides to wait for the store to
     complete (__fscache_wait_on_page_write()).

 (8) This blocks a slow-work processing thread - possibly against itself.

The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.

To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed.  This means that some data won't make it into the
cache this time.  To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.

The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan".  There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.

What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages.  If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:35 +00:00
David Howells
e3d4d28b1c FS-Cache: Handle read request vs lookup, creation or other cache failure
FS-Cache doesn't correctly handle the netfs requesting a read from the cache
on an object that failed or was withdrawn by the cache.  A trace similar to
the following might be seen:

	CacheFiles: Lookup failed error -105
	[exe   ] unexpected submission OP165afe [OBJ6cac OBJECT_LC_DYING]
	[exe   ] objstate=OBJECT_LC_DYING [OBJECT_LC_DYING]
	[exe   ] objflags=0
	[exe   ] objevent=9 [fffffffffffffffb]
	[exe   ] ops=0 inp=0 exc=0
	Pid: 6970, comm: exe Not tainted 2.6.32-rc6-cachefs #50
	Call Trace:
	 [<ffffffffa0076477>] fscache_submit_op+0x3ff/0x45a [fscache]
	 [<ffffffffa0077997>] __fscache_read_or_alloc_pages+0x187/0x3c4 [fscache]
	 [<ffffffffa00b6480>] ? nfs_readpage_from_fscache_complete+0x0/0x66 [nfs]
	 [<ffffffffa00b6388>] __nfs_readpages_from_fscache+0x7e/0x176 [nfs]
	 [<ffffffff8108e483>] ? __alloc_pages_nodemask+0x11c/0x5cf
	 [<ffffffffa009d796>] nfs_readpages+0x114/0x1d7 [nfs]
	 [<ffffffff81090314>] __do_page_cache_readahead+0x15f/0x1ec
	 [<ffffffff81090228>] ? __do_page_cache_readahead+0x73/0x1ec
	 [<ffffffff810903bd>] ra_submit+0x1c/0x20
	 [<ffffffff810906bb>] ondemand_readahead+0x227/0x23a
	 [<ffffffff81090762>] page_cache_sync_readahead+0x17/0x19
	 [<ffffffff8108a99e>] generic_file_aio_read+0x236/0x5a0
	 [<ffffffffa00937bd>] nfs_file_read+0xe4/0xf3 [nfs]
	 [<ffffffff810b2fa2>] do_sync_read+0xe3/0x120
	 [<ffffffff81354cc3>] ? _spin_unlock_irq+0x2b/0x31
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff811848e5>] ? selinux_file_permission+0x5d/0x10f
	 [<ffffffff81352bdb>] ? thread_return+0x3e/0x101
	 [<ffffffff8117d7b0>] ? security_file_permission+0x11/0x13
	 [<ffffffff810b3b06>] vfs_read+0xaa/0x16f
	 [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
	 [<ffffffff810b3c84>] sys_read+0x45/0x6c
	 [<ffffffff8100ae2b>] system_call_fastpath+0x16/0x1b

The object state might also be OBJECT_DYING or OBJECT_WITHDRAWING.

This should be handled by simply rejecting the new operation with ENOBUFS.
There's no need to log an error for it.  Events of this type now appear in the
stats file under Ops:rej.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:32 +00:00
David Howells
285e728b0a FS-Cache: Don't delete pending pages from the page-store tracking tree
Don't delete pending pages from the page-store tracking tree, but rather send
them for another write as they've presumably been updated.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:29 +00:00
David Howells
1bccf513ac FS-Cache: Fix lock misorder in fscache_write_op()
FS-Cache has two structs internally for keeping track of the internal state of
a cached file: the fscache_cookie struct, which represents the netfs's state,
and fscache_object struct, which represents the cache's state.  Each has a
pointer that points to the other (when both are in existence), and each has a
spinlock for pointer maintenance.

Since netfs operations approach these structures from the cookie side, they get
the cookie lock first, then the object lock.  Cache operations, on the other
hand, approach from the object side, and get the object lock first.  It is not
then permitted for a cache operation to get the cookie lock whilst it is
holding the object lock lest deadlock occur; instead, it must do one of two
things:

 (1) increment the cookie usage counter, drop the object lock and then get both
     locks in order, or

 (2) simply hold the object lock as certain parts of the cookie may not be
     altered whilst the object lock is held.

It is also not permitted to follow either pointer without holding the lock at
the end you start with.  To break the pointers between the cookie and the
object, both locks must be held.

fscache_write_op(), however, violates the locking rules: It attempts to get the
cookie lock without (a) checking that the cookie pointer is a valid pointer,
and (b) holding the object lock to protect the cookie pointer whilst it follows
it.  This is so that it can access the pending page store tree without
interference from __fscache_write_page().

This is fixed by splitting the cookie lock, such that the page store tracking
tree is protected by its own lock, and checking that the cookie pointer is
non-NULL before we attempt to follow it whilst holding the object lock.

The new lock is subordinate to both the cookie lock and the object lock, and so
should be taken after those.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:25 +00:00
David Howells
6897e3df8f FS-Cache: The object-available state can't rely on the cookie to be available
The object-available state in the object processing state machine (as
processed by fscache_object_available()) can't rely on the cookie to be
available because the FSCACHE_COOKIE_CREATING bit may have been cleared by
fscache_obtained_object() prior to the object being put into the
FSCACHE_OBJECT_AVAILABLE state.

Clearing the FSCACHE_COOKIE_CREATING bit on a cookie permits
__fscache_relinquish_cookie() to proceed and detach the cookie from the
object.

To deal with this, we don't dereference object->cookie in
fscache_object_available() if the object has already been detached.

In addition, a couple of assertions are added into fscache_drop_object() to
make sure the object is unbound from the cookie before it gets there.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:22 +00:00
David Howells
5753c44188 FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase
Permit the operations to retrieve data from the cache or to allocate space in
the cache for future writes to be interrupted whilst they're waiting for
permission for the operation to proceed.  Typically this wait occurs whilst the
cache object is being looked up on disk in the background.

If an interruption occurs, and the operation has not yet been given the
go-ahead to run, the operation is dequeued and cancelled, and control returns
to the read operation of the netfs routine with none of the requested pages
having been read or in any way marked as known by the cache.

This means that the initial wait is done interruptibly rather than
uninterruptibly.

In addition, extra stats values are made available to show the number of ops
cancelled and the number of cache space allocations interrupted.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:19 +00:00
David Howells
b34df792b4 FS-Cache: Use radix tree preload correctly in tracking of pages to be stored
__fscache_write_page() attempts to load the radix tree preallocation pool for
the CPU it is on before calling radix_tree_insert(), as the insertion must be
done inside a pair of spinlocks.

Use of the preallocation pool, however, is contingent on the radix tree being
initialised without __GFP_WAIT specified.  __fscache_acquire_cookie() was
passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT.

The solution is to AND out __GFP_WAIT.

Additionally, the banner comment to radix_tree_preload() is altered to make
note of this prerequisite.  Possibly there should be a WARN_ON() too.

Without this fix, I have seen the following recursive deadlock caused by
radix_tree_insert() attempting to allocate memory inside the spinlocked
region, which resulted in FS-Cache being called back into to release memory -
which required the spinlock already held.

=============================================
[ INFO: possible recursive locking detected ]
2.6.32-rc6-cachefs #24
---------------------------------------------
nfsiod/7916 is trying to acquire lock:
 (&cookie->lock){+.+.-.}, at: [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]

but task is already holding lock:
 (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]

other info that might help us debug this:
5 locks held by nfsiod/7916:
 #0:  (nfsiod){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
 #1:  (&task->u.tk_work#2){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
 #2:  (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]
 #3:  (&object->lock#2){+.+.-.}, at: [<ffffffffa0076b07>] __fscache_write_page+0x197/0x3f3 [fscache]
 #4:  (&cookie->stores_lock){+.+...}, at: [<ffffffffa0076b0f>] __fscache_write_page+0x19f/0x3f3 [fscache]

stack backtrace:
Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24
Call Trace:
 [<ffffffff8105ac7f>] __lock_acquire+0x1649/0x16e3
 [<ffffffff81059ded>] ? __lock_acquire+0x7b7/0x16e3
 [<ffffffff8100e27d>] ? dump_trace+0x248/0x257
 [<ffffffff8105ad70>] lock_acquire+0x57/0x6d
 [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
 [<ffffffff8135467c>] _spin_lock+0x2c/0x3b
 [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
 [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]
 [<ffffffffa0077eb7>] ? __fscache_check_page_write+0x0/0x71 [fscache]
 [<ffffffffa00b4755>] nfs_fscache_release_page+0x86/0xc4 [nfs]
 [<ffffffffa00907f0>] nfs_release_page+0x3c/0x41 [nfs]
 [<ffffffff81087ffb>] try_to_release_page+0x32/0x3b
 [<ffffffff81092c2b>] shrink_page_list+0x316/0x4ac
 [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
 [<ffffffff8135451b>] ? _spin_unlock_irq+0x2b/0x31
 [<ffffffff81093153>] shrink_inactive_list+0x392/0x67c
 [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
 [<ffffffff810934ca>] shrink_list+0x8d/0x8f
 [<ffffffff81093744>] shrink_zone+0x278/0x33c
 [<ffffffff81052c70>] ? ktime_get_ts+0xad/0xba
 [<ffffffff8109453b>] try_to_free_pages+0x22e/0x392
 [<ffffffff8109184c>] ? isolate_pages_global+0x0/0x212
 [<ffffffff8108e16b>] __alloc_pages_nodemask+0x3dc/0x5cf
 [<ffffffff810ae24a>] cache_alloc_refill+0x34d/0x6c1
 [<ffffffff811bcf74>] ? radix_tree_node_alloc+0x52/0x5c
 [<ffffffff810ae929>] kmem_cache_alloc+0xb2/0x118
 [<ffffffff811bcf74>] radix_tree_node_alloc+0x52/0x5c
 [<ffffffff811bcfd5>] radix_tree_insert+0x57/0x19c
 [<ffffffffa0076b53>] __fscache_write_page+0x1e3/0x3f3 [fscache]
 [<ffffffffa00b4248>] __nfs_readpage_to_fscache+0x58/0x11e [nfs]
 [<ffffffffa009bb77>] nfs_readpage_release+0x34/0x9b [nfs]
 [<ffffffffa009c0d9>] nfs_readpage_release_full+0x32/0x4b [nfs]
 [<ffffffffa0006cff>] rpc_release_calldata+0x12/0x14 [sunrpc]
 [<ffffffffa0006e2d>] rpc_free_task+0x59/0x61 [sunrpc]
 [<ffffffffa0006f03>] rpc_async_release+0x10/0x12 [sunrpc]
 [<ffffffff810482e5>] worker_thread+0x1ef/0x2e2
 [<ffffffff81048290>] ? worker_thread+0x19a/0x2e2
 [<ffffffff81352433>] ? thread_return+0x3e/0x101
 [<ffffffffa0006ef3>] ? rpc_async_release+0x0/0x12 [sunrpc]
 [<ffffffff8104bff5>] ? autoremove_wake_function+0x0/0x34
 [<ffffffff81058d25>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810480f6>] ? worker_thread+0x0/0x2e2
 [<ffffffff8104bd21>] kthread+0x7a/0x82
 [<ffffffff8100beda>] child_rip+0xa/0x20
 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
 [<ffffffff8104c2b9>] ? add_wait_queue+0x15/0x44
 [<ffffffff8104bca7>] ? kthread+0x0/0x82
 [<ffffffff8100bed0>] ? child_rip+0x0/0x20

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:14 +00:00
David Howells
7e311a207d FS-Cache: Clear netfs pointers in cookie after detaching object, not before
Clear the pointers from the fscache_cookie struct to netfs private data after
clearing the pointer to the cookie from the fscache_object struct and
releasing the object lock, rather than before.

This allows the netfs private data pointers to be relied on simply by holding
the object lock, rather than having to hold the cookie lock.  This is makes
things simpler as the cookie lock has to be taken before the object lock, but
sometimes the object pointer is all that the code has.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:11 +00:00
David Howells
52bd75fdb1 FS-Cache: Add counters for entry/exit to/from cache operation functions
Count entries to and exits from cache operation table functions.  Maintain
these as a single counter that's added to or removed from as appropriate.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:08 +00:00
David Howells
4fbf4291aa FS-Cache: Allow the current state of all objects to be dumped
Allow the current state of all fscache objects to be dumped by doing:

	cat /proc/fs/fscache/objects

By default, all objects and all fields will be shown.  This can be restricted
by adding a suitable key to one of the caller's keyrings (such as the session
keyring):

	keyctl add user fscache:objlist "<restrictions>" @s

The <restrictions> are:

	K	Show hexdump of object key (don't show if not given)
	A	Show hexdump of object aux data (don't show if not given)

And paired restrictions:

	C	Show objects that have a cookie
	c	Show objects that don't have a cookie
	B	Show objects that are busy
	b	Show objects that aren't busy
	W	Show objects that have pending writes
	w	Show objects that don't have pending writes
	R	Show objects that have outstanding reads
	r	Show objects that don't have outstanding reads
	S	Show objects that have slow work queued
	s	Show objects that don't have slow work queued

If neither side of a restriction pair is given, then both are implied.  For
example:

	keyctl add user fscache:objlist KB @s

shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data.  It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:04 +00:00
David Howells
440f0affe2 FS-Cache: Annotate slow-work runqueue proc lines for FS-Cache work items
Annotate slow-work runqueue proc lines for FS-Cache work items.  Objects
include the object ID and the state.  Operations include the object ID, the
operation ID and the operation type and state.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:01 +00:00
David Howells
3d7a641e54 SLOW_WORK: Wait for outstanding work items belonging to a module to clear
Wait for outstanding slow work items belonging to a module to clear when
unregistering that module as a user of the facility.  This prevents the put_ref
code of a work item from being taken away before it returns.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:10:23 +00:00
David S. Miller
3505d1a9fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/sfc/sfe4001.c
	drivers/net/wireless/libertas/cmd.c
	drivers/staging/Kconfig
	drivers/staging/Makefile
	drivers/staging/rtl8187se/Kconfig
	drivers/staging/rtl8192e/Kconfig
2009-11-18 22:19:03 -08:00
Eric W. Biederman
6d4561110a sysctl: Drop & in front of every proc_handler.
For consistency drop & in front of every proc_handler.  Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-11-18 08:37:40 -08:00
Peter Zijlstra
978b4053ae fcntl: rename F_OWNER_GID to F_OWNER_PGRP
This is for consistency with various ioctl() operations that include the
suffix "PGRP" in their names, and also for consistency with PRIO_PGRP,
used with setpriority() and getpriority().  Also, using PGRP instead of
GID avoids confusion with the common abbreviation of "group ID".

I'm fine with anything that makes it more consistent, and if PGRP is what
is the predominant abbreviation then I see no need to further confuse
matters by adding a third one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-17 17:40:33 -08:00
Stefani Seibold
9ebd4eba76 procfs: fix /proc/<pid>/stat stack pointer for kernel threads
Fix a small issue for the stack pointer in /proc/<pid>/stat.  In case of a
kernel thread the value of the printed stack pointer should be 0.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-17 17:40:33 -08:00
Linus Torvalds
ac50e95078 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: copy li_lsn before dropping AIL lock
  XFS bug in log recover with quota (bugzilla id 855)
2009-11-17 09:42:35 -08:00
Linus Torvalds
8a1eaa6a56 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: clear server inode number flag while autodisabling
2009-11-17 09:20:38 -08:00
Linus Torvalds
82abc2a97a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: deleted inconsistent comment in nilfs_load_inode_block()
  nilfs2: deleted struct nilfs_dat_group_desc
  nilfs2: fix lock order reversal in chcp operation
2009-11-17 09:15:18 -08:00
Nathaniel W. Turner
6c06f072c2 xfs: copy li_lsn before dropping AIL lock
Access to log items on the AIL is generally protected by m_ail_lock;
this is particularly needed when we're getting or setting the 64-bit
li_lsn on a 32-bit platform.  This patch fixes a couple places where we
were accessing the log item after dropping the AIL lock on 32-bit
machines.

This can result in a partially-zeroed log->l_tail_lsn if
xfs_trans_ail_delete is racing with xfs_trans_ail_update, and in at
least some cases, this can leave the l_tail_lsn with a zero cycle
number, which means xlog_space_left will think the log is full (unless
CONFIG_XFS_DEBUG is set, in which case we'll trip an ASSERT), leading to
processes stuck forever in xlog_grant_log_space.

Thanks to Adrian VanderSpek for first spotting the race potential and to
Dave Chinner for debug assistance.

Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-11-17 10:26:49 -06:00
Jan Rekorajski
8ec6dba258 XFS bug in log recover with quota (bugzilla id 855)
Hi,
I was hit by a bug in linux 2.6.31 when XFS is not able to recover the
log after a crash if fs was mounted with quotas. Gory details in XFS
bugzilla: http://oss.sgi.com/bugzilla/show_bug.cgi?id=855.

It looks like wrong struct is used in buffer length check, and the following
patch should fix the problem.

xfs_dqblk_t has a size of 104+32 bytes, while xfs_disk_dquot_t is 104 bytes
long, and this is exactly what I see in system logs - "XFS: dquot too small
(104) in xlog_recover_do_dquot_trans."

Signed-off-by: Jan Rekorajski <baggins@sith.mimuw.edu.pl>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-11-17 10:26:38 -06:00
Eric W. Biederman
bb9074ff58 Merge commit 'v2.6.32-rc7'
Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.
2009-11-17 01:01:34 -08:00
Suresh Jayaraman
f534dc9943 cifs: clear server inode number flag while autodisabling
Fix the commit ec06aedd44 that intended to turn off querying for server inode
numbers when server doesn't consistently support inode numbers. Presumably
the commit didn't actually clear the CIFS_MOUNT_SERVER_INUM flag, perhaps a
typo.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-11-16 15:24:03 +00:00
Theodore Ts'o
1032988c71 ext4: fix block validity checks so they work correctly with meta_bg
The block validity checks used by ext4_data_block_valid() wasn't
correctly written to check file systems with the meta_bg feature.  Fix
this.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:29:56 -05:00
Theodore Ts'o
8dadb198cb ext4: fix uninit block bitmap initialization when s_meta_first_bg is non-zero
The number of old-style block group descriptor blocks is
s_meta_first_bg when the meta_bg feature flag is set.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:38 -05:00
Theodore Ts'o
3f8fb9490e ext4: don't update the superblock in ext4_statfs()
commit a71ce8c6c9 updated ext4_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change.  This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:52 -05:00
Eric Sandeen
86ebfd08a1 ext4: journal all modifications in ext4_xattr_set_handle
ext4_xattr_set_handle() was zeroing out an inode outside
of journaling constraints; this is one of the accesses that
was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Reviewed-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:30:52 -05:00
Julia Lawall
30c6e07a92 ext4: fix i_flags access in ext4_da_writepages_trans_blocks()
We need to be testing the i_flags field in the ext4 specific portion
of the inode, instead of the (confusingly aliased) i_flags field in
the generic struct inode.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:30:58 -05:00
Theodore Ts'o
5068969686 ext4: make sure directory and symlink blocks are revoked
When an inode gets unlinked, the functions ext4_clear_blocks() and
ext4_remove_blocks() call ext4_forget() for all the buffer heads
corresponding to the deleted inode's data blocks.  If the inode is a
directory or a symlink, the is_metadata parameter must be non-zero so
ext4_forget() will revoke them via jbd2_journal_revoke().  Otherwise,
if these blocks are reused for a data file, and the system crashes
before a journal checkpoint, the journal replay could end up
corrupting these data blocks.

Thanks to Curt Wohlgemuth for pointing out potential problems in this
area.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:17:34 -05:00
Theodore Ts'o
beac2da756 ext4: add tracepoint for ext4_forget()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:25:08 -05:00
Theodore Ts'o
cf40db137c ext4: remove failed journal checksum check
Now that we are checking for failed journal checksums in the jbd2
layer, we don't need to check in the ext4 mount path --- since a
checksum fail will result in ext4_load_journal() returning an error,
causing the file system to refuse to be mounted until e2fsck can deal
with the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 21:00:01 -05:00
Theodore Ts'o
e6a47428de jbd2: don't wipe the journal on a failed journal checksum
If there is a failed journal checksum, don't reset the journal.  This
allows for userspace programs to decide how to recover from this
situation.  It may be that ignoring the journal checksum failure might
be a better way of recovering the file system.  Once we add per-block
checksums, we can definitely do better.  Until then, a system
administrator can try backing up the file system image (or taking a
snapshot) and and trying to determine experimentally whether ignoring
the checksum failure or aborting the journal replay results in less
data loss.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:31:37 -05:00
Jiro SEKIBA
18dafac1a4 nilfs2: deleted inconsistent comment in nilfs_load_inode_block()
The comment says, "Caller of this function MUST lock s_inode_lock",
however just above the comment, it locks s_inode_lock in the function.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-15 17:17:46 +09:00
Petr Vandrovec
479c2553af Fix memory corruption caused by nfsd readdir+
Commit 8177e6d6df ("nfsd: clean up
readdirplus encoding") introduced single character typo in nfs3 readdir+
implementation.  Unfortunately that typo has quite bad side effects:
random memory corruption, followed (on my box) with immediate
spontaneous box reboot.

Using 'p1' instead of 'p' fixes my Linux box rebooting whenever VMware
ESXi box tries to list contents of my home directory.

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-14 12:55:55 -08:00
Theodore Ts'o
567f3e9a70 ext4: plug a buffer_head leak in an error path of ext4_iget()
One of the invalid error paths in ext4_iget() forgot to brelse() the
inode buffer head.  Fix it by adding a brelse() in the common error
return path, which also simplifies function.

Thanks to Andi Kleen <ak@linux.intel.com> reporting the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-14 08:19:05 -05:00
Akira Fujita
92c28159dc ext4: fix spelling typos in move_extent.c
Fix a few spelling typos in move_extent.c

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.co.jp>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:24:50 -05:00
Akira Fujita
49bd22bc4d ext4: fix possible recursive locking warning in EXT4_IOC_MOVE_EXT
If CONFIG_PROVE_LOCKING is enabled, the double_down_write_data_sem()
will trigger a false-positive warning of a recursive lock.  Since we
take i_data_sem for the two inodes ordered by their inode numbers,
this isn't a problem.  Use of down_write_nested() will notify the lock
dependency checker machinery that there is no problem here.

This problem was reported by Brian Rogers:

	http://marc.info/?l=linux-ext4&m=125115356928011&w=1

Reported-by: Brian Rogers <brian@xyzw.org>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:24:41 -05:00
Akira Fujita
fc04cb49a8 ext4: fix lock order problem in ext4_move_extents()
ext4_move_extents() checks the logical block contiguousness
of original file with ext4_find_extent() and mext_next_extent().
Therefore the extent which ext4_ext_path structure indicates
must not be changed between above functions.

But in current implementation, there is no i_data_sem protection
between ext4_ext_find_extent() and mext_next_extent().  So the extent
which ext4_ext_path structure indicates may be overwritten by
delalloc.  As a result, ext4_move_extents() will exchange wrong blocks
between original and donor files.  I change the place where
acquire/release i_data_sem to solve this problem.

Moreover, I changed move_extent_per_page() to start transaction first,
and then acquire i_data_sem.  Without this change, there is a
possibility of the deadlock between mmap() and ext4_move_extents():

* NOTE: "A", "B" and "C" mean different processes

A-1: ext4_ext_move_extents() acquires i_data_sem of two inodes.

B:   do_page_fault() starts the transaction (T),
     and then tries to acquire i_data_sem.
     But process "A" is already holding it, so it is kept waiting.

C:   While "A" and "B" running, kjournald2 tries to commit transaction (T)
     but it is under updating, so kjournald2 waits for it.

A-2: Call ext4_journal_start with holding i_data_sem,
     but transaction (T) is locked.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:24:43 -05:00
Akira Fujita
f868a48d06 ext4: fix the returned block count if EXT4_IOC_MOVE_EXT fails
If the EXT4_IOC_MOVE_EXT ioctl fails, the number of blocks that were
exchanged before the failure should be returned to the userspace
caller.  Unfortunately, currently if the block size is not the same as
the page size, the returned block count that is returned is the
page-aligned block count instead of the actual block count.  This
commit addresses this bug.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:25:48 -05:00
Theodore Ts'o
503358ae01 ext4: avoid divide by zero when trying to mount a corrupted file system
If s_log_groups_per_flex is greater than 31, then groups_per_flex will
will overflow and cause a divide by zero error.  This can cause kernel
BUG if such a file system is mounted.

Thanks to Nageswara R Sastry for analyzing the failure and providing
an initial patch.

http://bugzilla.kernel.org/show_bug.cgi?id=14287

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:46 -05:00
Theodore Ts'o
2de770a406 ext4: fix potential buffer head leak when add_dirent_to_buf() returns ENOSPC
Previously add_dirent_to_buf() did not free its passed-in buffer head
in the case of ENOSPC, since in some cases the caller still needed it.
However, this led to potential buffer head leaks since not all callers
dealt with this correctly.  Fix this by making simplifying the freeing
convention; now add_dirent_to_buf() *never* frees the passed-in buffer
head, and leaves that to the responsibility of its caller.  This makes
things cleaner and easier to prove that the code is neither leaking
buffer heads or calling brelse() one time too many.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Curt Wohlgemuth <curtw@google.com>
Cc: stable@kernel.org
2009-11-23 07:25:49 -05:00
Sunil Mushran
7aee47b0bb ocfs2: Trivial cleanup of jbd compatibility layer removal
Mainline commit 53ef99cad9 removed the
JBD compatibility layer from OCFS2. This patch removes the last remaining
remnants of that.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-11-13 15:45:05 -08:00
Ryusuke Konishi
c1ea985c71 nilfs2: fix lock order reversal in chcp operation
Will fix the following lock order reversal lockdep detected:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc6 #7
-------------------------------------------------------
chcp/30157 is trying to acquire lock:
 (&nilfs->ns_mount_mutex){+.+.+.}, at: [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]

but task is already holding lock:
 (&nilfs->ns_segctor_sem){++++.+}, at: [<fed7ca32>] nilfs_transaction_begin+0xba/0x110 [nilfs2]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&nilfs->ns_segctor_sem){++++.+}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c14151e2>] down_read+0x31/0x45
       [<fed6d77b>] nilfs_attach_checkpoint+0x8f/0x16b [nilfs2]
       [<fed6e393>] nilfs_get_sb+0x3e7/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #1 (&type->s_umount_key#31/1){+.+.+.}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c104c0f3>] down_write_nested+0x34/0x52
       [<c10c08fe>] sget+0x22e/0x389
       [<fed6e133>] nilfs_get_sb+0x187/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #0 (&nilfs->ns_mount_mutex){+.+.+.}:
       [<c1057727>] __lock_acquire+0xe27/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c1414d63>] mutex_lock_nested+0x41/0x23e
       [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]
       [<fed801b2>] nilfs_ioctl+0x11a/0x7da [nilfs2]
       [<c10cca12>] vfs_ioctl+0x27/0x6e
       [<c10ccf93>] do_vfs_ioctl+0x491/0x4db
       [<c10cd022>] sys_ioctl+0x45/0x5f
       [<c1002a14>] sysenter_do_call+0x12/0x32

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-11-13 10:33:24 +09:00
Mike Hommey
e04b5ef8b4 __generic_block_fiemap(): fix for files bigger than 4GB
Because of an integer overflow on start_blk, various kind of wrong results
would be returned by the generic_block_fiemap() handler, such as no
extents when there is a 4GB+ hole at the beginning of the file, or wrong
fe_logical when an extent starts after the first 4GB.

Signed-off-by: Mike Hommey <mh@glandium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@sgi.com>
Cc: Josef Bacik <jbacik@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-12 07:26:01 -08:00
Anton Blanchard
fc63cf2370 exec: setup_arg_pages() fails to return errors
In setup_arg_pages we work hard to assign a value to ret, but on exit we
always return 0.

Also remove a now duplicated exit path and branch to out_unlock instead.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-12 07:25:58 -08:00
Heiko Carstens
7779d7bed9 fs: add missing compat_ptr handling for FS_IOC_RESVSP ioctl
For FS_IOC_RESVSP and FS_IOC_RESVSP64 compat_sys_ioctl() uses its
arg argument as a pointer to userspace. However it is missing a
a call to compat_ptr() which will do a proper pointer conversion.

This was introduced with 3e63cbb1 "fs: Add new pre-allocation ioctls
to vfs for compatibility with legacy xfs ioctls".

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ankit Jain <me@ankitjain.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Arnd Bergmann <arndbergmann@googlemail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: <stable@kernel.org>		[2.6.31.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-12 07:25:57 -08:00
Sukadev Bhattiprolu
29f12ca321 pidns: fix a leak in /proc dentries and inodes with pid namespaces.
Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
that is discussed in:

	http://lkml.org/lkml/2009/10/2/159.

To summarize the thread, when container-init is terminated, it sets the
PF_EXITING flag, zaps other processes in the container and waits to reap
them.  As a part of reaping, the container-init should flush any /proc
dentries associated with the processes.  But because the container-init is
itself exiting and the following PF_EXITING check, the dentries are not
flushed, resulting in leak in /proc inodes and dentries.

This fix reverts the commit 7766755a2f ("Fix /proc dcache deadlock
in do_exit") which introduced the check for PF_EXITING.  At the time of
the commit, shrink_dcache_parent() flushed dentries from other filesystems
also and could have caused a deadlock which the commit fixed.  But as
pointed out by Eric Biederman, after commit 0feae5c47a,
shrink_dcache_parent() no longer affects other filesystems.  So reverting
the commit is now safe.

As pointed out by Jan Kara, the leak is not as critical since the
unclaimed space will be reclaimed under memory pressure or by:

	echo 3 > /proc/sys/vm/drop_caches

But since this check is no longer required, its best to remove it.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Jan Kara <jack@ucw.cz>
Cc: Andrea Arcangeli <andrea@cpushare.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-12 07:25:57 -08:00
Eric W. Biederman
ab09203e30 sysctl fs: Remove dead binary sysctl support
Now that sys_sysctl is a generic wrapper around /proc/sys  .ctl_name
and .strategy members of sysctl tables are dead code.  Remove them.

Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-11-12 02:04:55 -08:00
Stefan Schmidt
ff5e4b51a3 fs/jbd: Export log_start_commit to fix ext3 build.
This fixes:
ERROR: "log_start_commit" [fs/ext3/ext3.ko] undefined!

Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2009-11-12 10:24:12 +01:00
Linus Torvalds
aa021baa32 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix panic when trying to destroy a newly allocated
  Btrfs: allow more metadata chunk preallocation
  Btrfs: fallback on uncompressed io if compressed io fails
  Btrfs: find ideal block group for caching
  Btrfs: avoid null deref in unpin_extent_cache()
  Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root
  Btrfs: fix some metadata enospc issues
  Btrfs: fix how we set max_size for free space clusters
  Btrfs: cleanup transaction starting and fix journal_info usage
  Btrfs: fix data allocation hint start
2009-11-11 13:38:59 -08:00
Josef Bacik
a6dbd429d8 Btrfs: fix panic when trying to destroy a newly allocated
There is a problem where iget5_locked will look for an inode, not find it, and
then subsequently try to allocate it.  Another CPU will have raced in and
allocated the inode instead, so when iget5_locked gets the inode spin lock again
and does a search, it finds the new inode.  So it goes ahead and calls
destroy_inode on the inode it just allocated.  The problem is we don't set
BTRFS_I(inode)->root until the new inode is completely initialized.  This patch
makes us set root to NULL when alloc'ing a new inode, so when we get to
btrfs_destroy_inode and we see that root is NULL we can just free up the memory
and continue on.  This fixes the panic

http://www.kerneloops.org/submitresult.php?number=812690

Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 15:53:34 -05:00
Linus Torvalds
fd801452a3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  JBD/JBD2: free j_wbuf if journal init fails.
  ext3: Wait for proper transaction commit on fsync
  ext3: retry failed direct IO allocations
2009-11-11 11:52:22 -08:00
Linus Torvalds
16fe4101ae Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: partial revert to fix double brelse WARNING()
  ext4: Fix return value of ext4_split_unwritten_extents() to fix direct I/O
  ext4: code clean up for dio fallocate handling
  ext4: skip conversion of uninit extents after direct IO if there isn't any
  ext4: fix ext4_ext_direct_IO()'s return value after converting uninit extents
  ext4: discard preallocation when restarting a transaction during truncate
2009-11-11 11:28:11 -08:00
Chris Mason
33b2580864 Btrfs: allow more metadata chunk preallocation
On an FS where all of the space has not been allocated into chunks yet,
the enospc can return enospc just because the existing metadata chunks
are full.

We get around this by allowing more metadata chunks to be allocated up
to a certain limit, and finding the right limit is a little fuzzy.  The
problem is the reservations for delalloc would preallocate way too much
of the FS as metadata.  We need to start saying no and just force some
IO to happen.

But we also need to let a reasonable amount of the FS become metadata.
This bumps the hard limit up, later releases will have a better system.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:20 -05:00
Josef Bacik
f5a84ee3cd Btrfs: fallback on uncompressed io if compressed io fails
Currently compressed IO does not deal with not having its entire extent able to
be allocated.  So if we have enough free space to allocate for the extent, but
its not contiguous, it will fail spectacularly.  This patch fixes this by
falling back on uncompressed IO which lets us spread the delalloc extent across
multiple extents.  I tested this by making us randomly think the reservation had
failed to make it fallback on the uncompressed io way and it seemed to work
fine.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:20 -05:00
Josef Bacik
ccf0e72537 Btrfs: find ideal block group for caching
This patch changes a few things.  Hopefully the comments are helpfull, but
I'll try and be as verbose here.

Problem:

My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space.  The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system.  So

Solution:

1) Don't cache block groups willy-nilly at first.  Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.

2) Don't be so picky about cached block groups.  The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more.  So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.

There is one other tweak in here.  Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation.  This could make us end up with way too many chunks, so keep
track of this particular case.

With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:19 -05:00
Dan Carpenter
4eb3991c5d Btrfs: avoid null deref in unpin_extent_cache()
I re-orderred the checks to avoid dereferencing "em" if it was null.

Found by smatch static checker.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:18 -05:00
Li Dongyang
df66916e71 Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root
We don't need to call btrfs_release_path because btrfs_free_path will do
that for us.

Signed-off-by: Li Dongyang <Jerry87905@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:18 -05:00
Josef Bacik
5df6a9f606 Btrfs: fix some metadata enospc issues
We weren't reserving metadata space for rename, rmdir and unlink, which could
cause problems.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:17 -05:00
Josef Bacik
01dea1efc2 Btrfs: fix how we set max_size for free space clusters
This patch fixes a problem where max_size can be set to 0 even though we
filled the cluster properly.  We set max_size to 0 if we restart the cluster
window, but if the new start entry is big enough to be our new cluster then we
could return with a max_size set to 0, which will mean the next time we try to
allocate from this cluster it will fail.  So set max_extent to the entry's
size.  Tested this on my box and now we actually allocate from the cluster
after we fill it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:17 -05:00
Josef Bacik
249ac1e55c Btrfs: cleanup transaction starting and fix journal_info usage
We use journal_info to tell if we're in a nested transaction to make sure we
don't commit the transaction within a nested transaction.  We use another
method to see if there are any outstanding ioctl trans handles, so if we're
starting one do not set current->journal_info, since it will screw with other
filesystems.  This patch also cleans up the starting stuff so there aren't any
magic numbers.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:16 -05:00
Josef Bacik
6346c93988 Btrfs: fix data allocation hint start
Sometimes our start allocation hint when we cow a file can be either
EXTENT_HOLE or some other such place holder, which is not optimal.  So if we
find that our em->block_start is one of these special values, check to see
where the first block of the inode is stored, and use that as a hint.  If that
block is also a special value, just fallback on a hint of 0 and let the
allocator figure out a good place to put the data.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:16 -05:00
Tao Ma
7b02bec07e JBD/JBD2: free j_wbuf if journal init fails.
If journal init fails, we need to free j_wbuf.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-11-11 15:24:14 +01:00
Jan Kara
fe8bc91c4c ext3: Wait for proper transaction commit on fsync
We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2009-11-11 15:22:49 +01:00
Eric Sandeen
ea0174a713 ext3: retry failed direct IO allocations
On a 256M 4k block filesystem, doing this in a loop:

    dd if=/dev/zero of=test oflag=direct bs=1M count=64
    rm -f test

eventually leads to spurious ENOSPC:

    dd: writing `test': No space left on device

As with other block allocation callers, it looks like we need to
potentially retry the allocations on the initial ENOSPC.

A similar patch went into ext4 (commit
fbbf694566)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-11-11 15:22:49 +01:00
Eric W. Biederman
2315ffa0a9 sysctl: Don't look at ctl_name and strategy in the generic code
The ctl_name and strategy fields are unused, now that sys_sysctl
is a compatibility wrapper around /proc/sys.  No longer looking
at them in the generic code is effectively what we are doing
now and provides the guarantee that during further cleanups
we can just remove references to those fields and everything
will work ok.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-11-11 00:53:43 -08:00
Trond Myklebust
96d25e5322 NFSv4: Fix a cache validation bug which causes getcwd() to return ENOENT
Changeset a65318bf3a (NFSv4: Simplify some
cache consistency post-op GETATTRs) incorrectly changed the getattr
bitmap for readdir().
This causes the readdir() function to fail to return a
fileid/inode number, which again exposed a bug in the NFS readdir code that
causes spurious ENOENT errors to appear in applications (see
http://bugzilla.kernel.org/show_bug.cgi?id=14541).

The immediate band aid is to revert the incorrect bitmap change, but more
long term, we should change the NFS readdir code to cope with the
fact that NFSv4 servers are not required to support fileids/inode numbers.

Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-11-11 16:15:42 +09:00
Martin K. Petersen
86b3728141 block: Expose discard granularity
While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger.  Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-11-10 11:50:21 +01:00
Linus Torvalds
b7b69c7e97 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix missing cleanup of gc cache on error cases
  nilfs2: fix kernel oops in error case of nilfs_ioctl_move_blocks
2009-11-09 09:52:55 -08:00