aha/fs
David Howells 201a15428b FS-Cache: Handle pages pending storage that get evicted under OOM conditions
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache.  Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.

The problem is typified by the following trace of a stuck process:

	kslowd005     D 0000000000000000     0  4253      2 0x00000080
	 ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
	 0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
	 000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
	Call Trace:
	 [<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
	 [<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
	 [<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
	 [<ffffffff810885d3>] try_to_release_page+0x32/0x3b
	 [<ffffffff81093203>] shrink_page_list+0x316/0x4ac
	 [<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
	 [<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
	 [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
	 [<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
	 [<ffffffff81093aa2>] shrink_list+0x8d/0x8f
	 [<ffffffff81093d1c>] shrink_zone+0x278/0x33c
	 [<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
	 [<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
	 [<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
	 [<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
	 [<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
	 [<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
	 [<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
	 [<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
	 [<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
	 [<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
	 [<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
	 [<ffffffff810b2e82>] do_sync_write+0xe3/0x120
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
	 [<ffffffff810b1a76>] ? dentry_open+0x82/0x89
	 [<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
	 [<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
	 [<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
	 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
	 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
	 [<ffffffff8104be91>] kthread+0x7a/0x82
	 [<ffffffff8100beda>] child_rip+0xa/0x20
	 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
	 [<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
	 [<ffffffff8104be17>] ? kthread+0x0/0x82
	 [<ffffffff8100bed0>] ? child_rip+0x0/0x20

In the above backtrace, the following is happening:

 (1) A page storage operation is being executed by a slow-work thread
     (fscache_write_op()).

 (2) FS-Cache farms the operation out to the cache to perform
     (cachefiles_write_page()).

 (3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
     standard write (do_sync_write()) under KERNEL_DS directly from the netfs
     page.

 (4) However, for Ext3 to perform the write, it must allocate some memory, in
     particular, it must allocate at least one page cache page into which it
     can copy the data from the netfs page.

 (5) Under OOM conditions, the memory allocator can't immediately come up with
     a page, so it uses vmscan to find something to discard
     (try_to_free_pages()).

 (6) vmscan finds a clean netfs page it might be able to discard (possibly the
     one it's trying to write out).

 (7) The netfs is called to throw the page away (nfs_release_page()) - but it's
     called with __GFP_WAIT, so the netfs decides to wait for the store to
     complete (__fscache_wait_on_page_write()).

 (8) This blocks a slow-work processing thread - possibly against itself.

The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.

To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed.  This means that some data won't make it into the
cache this time.  To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.

The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan".  There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.

What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages.  If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:35 +00:00
..
9p FS-Cache: Handle pages pending storage that get evicted under OOM conditions 2009-11-19 18:11:35 +00:00
adfs adfs: remove redundant test on unsigned 2009-09-24 07:21:05 -07:00
affs
afs FS-Cache: Handle pages pending storage that get evicted under OOM conditions 2009-11-19 18:11:35 +00:00
autofs trivial: remove unnecessary semicolons 2009-09-21 15:14:58 +02:00
autofs4 autofs4 - fix missed case when changing to use struct path 2009-08-31 17:44:05 -10:00
befs fs: Make unload_nls() NULL pointer safe 2009-09-24 07:47:42 -04:00
bfs
btrfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2009-11-11 13:38:59 -08:00
cachefiles FS-Cache: Allow the current state of all objects to be dumped 2009-11-19 18:11:04 +00:00
cifs cifs: clear server inode number flag while autodisabling 2009-11-16 15:24:03 +00:00
coda headers: remove sched.h from poll.h 2009-10-04 15:05:10 -07:00
configfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
cramfs
debugfs
devpts Move magic numbers into magic.h 2009-09-23 07:39:28 -07:00
dlm dlm: fix socket fd translation 2009-09-30 12:19:44 -05:00
ecryptfs ima: ecryptfs fix imbalance message 2009-10-08 11:31:38 -05:00
efs
exofs exofs: remove BKL from super operations 2009-09-24 07:47:38 -04:00
exportfs
ext2 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2009-09-24 07:53:22 -07:00
ext3 ext3: Wait for proper transaction commit on fsync 2009-11-11 15:22:49 +01:00
ext4 ext4: partial revert to fix double brelse WARNING() 2009-11-08 15:45:44 -05:00
fat Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 2009-09-30 09:31:14 -07:00
freevxfs
fscache FS-Cache: Handle pages pending storage that get evicted under OOM conditions 2009-11-19 18:11:35 +00:00
fuse fuse: invalidate target of rename 2009-11-04 10:24:52 +01:00
gfs2 SLOW_WORK: Wait for outstanding work items belonging to a module to clear 2009-11-19 18:10:23 +00:00
hfs hfs: fix oops on mount with corrupted btree extent records 2009-10-29 07:39:29 -07:00
hfsplus hfsplus: refuse to mount volumes larger than 2TB 2009-10-29 07:39:27 -07:00
hostfs
hpfs
hppfs
hugetlbfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-09-24 08:32:11 -07:00
isofs fs: Make unload_nls() NULL pointer safe 2009-09-24 07:47:42 -04:00
jbd fs/jbd: Export log_start_commit to fix ext3 build. 2009-11-12 10:24:12 +01:00
jbd2 JBD/JBD2: free j_wbuf if journal init fails. 2009-11-11 15:24:14 +01:00
jffs2 Merge git://git.infradead.org/mtd-2.6 2009-09-23 10:07:49 -07:00
jfs fs: Make unload_nls() NULL pointer safe 2009-09-24 07:47:42 -04:00
lockd headers: utsname.h redux 2009-09-23 18:13:10 -07:00
minix V3 minixfs: add missing directory type checking 2009-09-23 07:39:57 -07:00
ncpfs const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
nfs FS-Cache: Handle pages pending storage that get evicted under OOM conditions 2009-11-19 18:11:35 +00:00
nfs_common
nfsd Fix memory corruption caused by nfsd readdir+ 2009-11-14 12:55:55 -08:00
nilfs2 nilfs2: deleted inconsistent comment in nilfs_load_inode_block() 2009-11-15 17:17:46 +09:00
nls Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 2009-09-30 09:31:14 -07:00
notify dnotify: ignore FS_EVENT_ON_CHILD 2009-10-20 18:02:33 -04:00
ntfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-09-24 08:32:11 -07:00
ocfs2 const: constify remaining file_operations 2009-10-01 16:11:11 -07:00
omfs const: constify remaining file_operations 2009-10-01 16:11:11 -07:00
openpromfs
partitions block: Seperate read and write statistics of in_flight requests v2 2009-10-06 20:16:55 +02:00
proc procfs: fix /proc/<pid>/stat stack pointer for kernel threads 2009-11-17 17:40:33 -08:00
qnx4 qnx4: remove write support 2009-09-23 07:39:30 -07:00
quota const: make struct super_block::s_qcop const 2009-09-22 07:17:24 -07:00
ramfs truncate: use new helpers 2009-09-24 08:41:47 -04:00
reiserfs const: make struct super_block::s_qcop const 2009-09-22 07:17:24 -07:00
romfs ROMFS: fix length used with romfs_dev_strnlen() function 2009-10-11 11:33:56 -07:00
smbfs fs: Make unload_nls() NULL pointer safe 2009-09-24 07:47:42 -04:00
squashfs const: mark remaining super_operations const 2009-09-22 07:17:24 -07:00
sysfs sysfs: Don't leak secdata when a sysfs_dirent is freed. 2009-11-05 08:19:18 +11:00
sysv
ubifs const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
udf udf: Fix possible corruption when close races with write 2009-09-14 19:13:01 +02:00
ufs
xfs xfs: copy li_lsn before dropping AIL lock 2009-11-17 10:26:49 -06:00
aio.c aio.c: move EXPORT* macros to line after function 2009-09-23 07:39:29 -07:00
anon_inodes.c headers: remove sched.h from poll.h 2009-10-04 15:05:10 -07:00
attr.c truncate: new helpers 2009-09-24 08:41:47 -04:00
bad_inode.c
binfmt_aout.c
binfmt_elf.c elf: clean up fill_note_info() 2009-09-24 07:21:01 -07:00
binfmt_elf_fdpic.c fdpic: ignore the loader's PT_GNU_STACK when calculating the stack size 2009-09-24 07:21:02 -07:00
binfmt_em86.c
binfmt_flat.c flat: use IS_ERR_VALUE() helper macro 2009-09-24 07:21:03 -07:00
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio-integrity.c
bio.c Fix bio_alloc() and bio_kmalloc() documentation 2009-11-02 11:41:13 +01:00
block_dev.c block: use after free bug in __blkdev_get 2009-10-26 15:27:11 +01:00
buffer.c Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block 2009-09-25 09:27:30 -07:00
char_dev.c fs/char_dev.c: remove useless loop 2009-09-24 07:21:03 -07:00
compat.c x86, fs: Fix x86 procfs stack information for threads on 64-bit 2009-11-04 13:25:03 +01:00
compat_binfmt_elf.c
compat_ioctl.c fs: add missing compat_ptr handling for FS_IOC_RESVSP ioctl 2009-11-12 07:25:57 -08:00
dcache.c
dcookies.c
direct-io.c
drop_caches.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
eventfd.c anonfd: split interface into file creation and install 2009-09-23 07:39:29 -07:00
eventpoll.c
exec.c exec: setup_arg_pages() fails to return errors 2009-11-12 07:25:58 -08:00
fcntl.c fcntl: rename F_OWNER_GID to F_OWNER_PGRP 2009-11-17 17:40:33 -08:00
fifo.c
file.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
file_table.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
filesystems.c
fs-writeback.c writeback: pass in super_block to bdi_start_writeback() 2009-09-26 00:10:40 +02:00
fs_struct.c
generic_acl.c
inode.c vfs: optimize touch_time() too 2009-09-24 07:47:27 -04:00
internal.h fs: fix overflow in sys_mount() for in-kernel calls 2009-09-24 08:40:15 -04:00
ioctl.c __generic_block_fiemap(): fix for files bigger than 4GB 2009-11-12 07:26:01 -08:00
ioprio.c
Kconfig powerpc: Cleanup Kconfig selection of hugetlbfs support 2009-10-30 15:03:54 +11:00
Kconfig.binfmt
libfs.c libfs: return error code on failed attr set 2009-09-24 07:47:30 -04:00
locks.c const: make lock_manager_operations const 2009-09-22 07:17:25 -07:00
Makefile
mbcache.c
mpage.c
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 2009-09-11 08:55:49 -07:00
namespace.c fs: fix overflow in sys_mount() for in-kernel calls 2009-09-24 08:40:15 -04:00
nfsctl.c
no-block.c
open.c fs: change sys_truncate length parameter type 2009-09-23 09:21:05 -07:00
pipe.c fs: pipe.c null pointer dereference 2009-10-22 08:11:44 +09:00
pnode.c
pnode.h
posix_acl.c
read_write.c vfs: remove redundant position check in do_sendfile 2009-09-24 07:47:34 -04:00
read_write.h
readdir.c
select.c headers: remove sched.h from poll.h 2009-10-04 15:05:10 -07:00
seq_file.c vfs: seq_file: add helpers for data filling 2009-09-24 07:47:35 -04:00
signalfd.c
splice.c Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block 2009-09-14 17:55:15 -07:00
stack.c
stat.c
super.c freeze_bdev: grab active reference to frozen superblocks 2009-09-24 07:47:41 -04:00
sync.c fs/buffer.c: clean up EXPORT* macros 2009-09-23 07:39:29 -07:00
timerfd.c
utimes.c
xattr.c VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx. 2009-09-10 10:11:22 +10:00
xattr_acl.c