aha/Documentation/filesystems
David Howells 201a15428b FS-Cache: Handle pages pending storage that get evicted under OOM conditions
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache.  Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.

The problem is typified by the following trace of a stuck process:

	kslowd005     D 0000000000000000     0  4253      2 0x00000080
	 ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
	 0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
	 000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
	Call Trace:
	 [<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
	 [<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
	 [<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
	 [<ffffffff810885d3>] try_to_release_page+0x32/0x3b
	 [<ffffffff81093203>] shrink_page_list+0x316/0x4ac
	 [<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
	 [<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
	 [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
	 [<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
	 [<ffffffff81093aa2>] shrink_list+0x8d/0x8f
	 [<ffffffff81093d1c>] shrink_zone+0x278/0x33c
	 [<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
	 [<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
	 [<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
	 [<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
	 [<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
	 [<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
	 [<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
	 [<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
	 [<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
	 [<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
	 [<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
	 [<ffffffff810b2e82>] do_sync_write+0xe3/0x120
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
	 [<ffffffff810b1a76>] ? dentry_open+0x82/0x89
	 [<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
	 [<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
	 [<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
	 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
	 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
	 [<ffffffff8104be91>] kthread+0x7a/0x82
	 [<ffffffff8100beda>] child_rip+0xa/0x20
	 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
	 [<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
	 [<ffffffff8104be17>] ? kthread+0x0/0x82
	 [<ffffffff8100bed0>] ? child_rip+0x0/0x20

In the above backtrace, the following is happening:

 (1) A page storage operation is being executed by a slow-work thread
     (fscache_write_op()).

 (2) FS-Cache farms the operation out to the cache to perform
     (cachefiles_write_page()).

 (3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
     standard write (do_sync_write()) under KERNEL_DS directly from the netfs
     page.

 (4) However, for Ext3 to perform the write, it must allocate some memory, in
     particular, it must allocate at least one page cache page into which it
     can copy the data from the netfs page.

 (5) Under OOM conditions, the memory allocator can't immediately come up with
     a page, so it uses vmscan to find something to discard
     (try_to_free_pages()).

 (6) vmscan finds a clean netfs page it might be able to discard (possibly the
     one it's trying to write out).

 (7) The netfs is called to throw the page away (nfs_release_page()) - but it's
     called with __GFP_WAIT, so the netfs decides to wait for the store to
     complete (__fscache_wait_on_page_write()).

 (8) This blocks a slow-work processing thread - possibly against itself.

The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.

To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed.  This means that some data won't make it into the
cache this time.  To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.

The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan".  There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.

What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages.  If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:35 +00:00
..
caching FS-Cache: Handle pages pending storage that get evicted under OOM conditions 2009-11-19 18:11:35 +00:00
configfs docsrc: build Documentation/ sources 2008-08-12 16:07:30 -07:00
pohmelfs Staging: Pohmelfs: Added IO permissions and priorities. 2009-04-17 11:06:30 -07:00
00-INDEX update Documentation/filesystems/00-INDEX with new nfsd related docs. 2009-04-28 12:54:45 -04:00
9p.txt 9p: Update documentation to add fscache related bits 2009-09-23 13:03:46 -05:00
adfs.txt Fix typos in /Documentation : 'U-Z' 2006-11-30 04:58:40 +01:00
affs.txt
afs.txt AFS: Documentation updates 2009-08-19 10:40:13 -07:00
autofs4-mount-control.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
automount-support.txt
befs.txt Fix typos in Documentation/: 'Q'-'R' 2006-10-03 22:54:15 +02:00
bfs.txt remove mention of CONFIG_KMOD from documentation 2008-07-22 19:24:29 +10:00
btrfs.txt Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING 2009-01-07 09:54:24 -05:00
cifs.txt
coda.txt
cramfs.txt
debugfs.txt Document the debugfs API 2009-06-06 10:28:14 -06:00
dentry-locking.txt
devpts.txt Document usage of multiple-instances of devpts 2009-01-02 10:19:36 -08:00
directory-locking Documentation: Fix up docs still talking about i_sem 2007-05-24 10:16:17 -07:00
dlmfs.txt Fix typos in Documentation/: 'D'-'E' 2006-10-03 22:47:42 +02:00
dnotify.txt Documentation: move dnotify.txt to filesystems/ 2008-02-07 08:42:17 -08:00
ecryptfs.txt eCryptfs: Move ecryptfs docs into Documentation/filesystems/ 2007-07-17 10:23:08 -07:00
exofs.txt exofs: Documentation 2009-03-31 19:44:38 +03:00
Exporting exportfs: update documentation 2007-10-22 08:13:21 -07:00
ext2.txt Doc fix: ext2 can only have 32,000 subdirs, not 32,768 2009-06-18 13:03:44 -07:00
ext3.txt ext3: Update documentation about ext3 quota mount options 2009-10-13 00:06:43 +02:00
ext4.txt Revert "ext4: Remove journal_checksum mount option and enable it by default" 2009-11-02 10:15:27 -08:00
fiemap.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
files.txt fix f_count description in Documentation/filesystems/files.txt 2008-12-31 18:07:42 -05:00
fuse.txt [PATCH] fuse: fix typo 2006-12-30 10:56:45 -08:00
gfs2-glocks.txt GFS2: Update docs 2009-05-19 10:23:23 +01:00
gfs2-uevents.txt GFS2: Add a document explaining GFS2's uevents 2009-08-17 11:11:41 +01:00
gfs2.txt GFS2: Update docs 2009-05-19 10:23:23 +01:00
hfs.txt
hfsplus.txt Documentation: document HFSPlus 2007-07-31 15:39:38 -07:00
hpfs.txt misc doc and kconfig typos 2007-05-09 08:58:15 +02:00
inotify.txt
isofs.txt isofs: let mode and dmode mount options override rock ridge mode setting 2009-06-18 13:03:45 -07:00
jfs.txt JFS: document uid, gid, and umask mount options in jfs.txt 2007-03-09 10:27:31 -06:00
knfsd-stats.txt Document /proc/fs/nfsd/pool_stats 2009-03-27 19:24:27 -04:00
Locking update Documentation/filesystems/Locking 2009-06-24 08:15:25 -04:00
locks.txt Documentation: move locks.txt in filesystems/ 2007-10-09 18:32:45 -04:00
mandatory-locking.txt locks: add warning about mandatory locking races 2007-10-09 18:32:45 -04:00
ncpfs.txt ncpfs: remove dead URL from documentation 2009-09-23 07:39:42 -07:00
nfs-rdma.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
nfs.txt NFS: Add a dns resolver for use with NFSv4 referrals and migration 2009-08-19 18:22:15 -04:00
nfs41-server.txt nfsd: revise 4.1 status documentation 2009-09-21 11:13:45 -04:00
nfsroot.txt trivial: fix typo "for for" in multiple files 2009-09-21 15:14:54 +02:00
nilfs2.txt nilfs2: modify list of unsupported features in caveats 2009-06-10 23:41:11 +09:00
ntfs.txt NTFS: update homepage 2008-09-02 19:21:37 -07:00
ocfs2.txt ocfs2: add mount option and Kconfig option for acl 2009-01-05 08:36:52 -08:00
omfs.txt omfs: add filesystem documentation 2008-07-26 12:00:05 -07:00
porting iget: remove iget() and the read_inode() super op as being obsolete 2008-02-07 08:42:29 -08:00
proc.txt ext4: Use tracepoints for mb_history trace file 2009-09-30 00:32:42 -04:00
quota.txt quota: documentation for sending "below quota" messages via netlink and tiny doc update 2008-08-12 16:07:27 -07:00
ramfs-rootfs-initramfs.txt Trivial Documentation/filesystems/ramfs-rootfs-initramfs.txt fix 2008-11-30 11:40:56 -08:00
relay.txt relay: add buffer-only channels; useful for early logging 2008-07-26 12:00:04 -07:00
romfs.txt
rpc-cache.txt Documentation: move rpc-cache.txt to filesystems/ 2008-04-11 13:20:52 -06:00
seq_file.txt Doc: seq_file.txt fix wrong dd command example. 2009-09-10 14:33:35 -06:00
sharedsubtree.txt doc/filesystems: more mount cleanups 2009-09-24 07:20:57 -07:00
smbfs.txt
spufs.txt Fix typos in /Documentation : 'U-Z' 2006-11-30 04:58:40 +01:00
squashfs.txt Squashfs: fix documentation typo, Cramfs filesystem limit is 256 MiB 2009-03-05 00:40:13 +00:00
sysfs-pci.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
sysfs.txt driver core: documentation: make it clear that sysfs is optional 2009-07-28 13:45:23 -07:00
sysv-fs.txt [PATCH] fs/sysv/: doc cleanup 2006-12-07 08:39:44 -08:00
tmpfs.txt hugh: update email address 2009-05-21 13:14:32 -07:00
ubifs.txt UBIFS: remove fast unmounting 2009-01-29 16:34:30 +02:00
udf.txt udf: implement mode and dmode mounting options 2009-04-02 12:29:50 +02:00
ufs.txt [PATCH] ufs2 write: mount as rw 2007-02-12 09:48:40 -08:00
vfat.txt vfat: change the default from shortname=lower to shortname=mixed 2009-08-01 21:35:25 +09:00
vfs.txt HWPOISON: Define a new error_remove_page address space op for async truncation 2009-09-16 11:50:13 +02:00
xfs.txt [XFS] remove restricted chown parameter from xfs linux 2008-10-30 18:30:09 +11:00
xip.txt DOC: update xip method info 2008-11-12 17:17:17 -08:00