aha/fs/proc
Hidetoshi Seto 0cf55e1ec0 sched, cputime: Introduce thread_group_times()
This is a real fix for problem of utime/stime values decreasing
described in the thread:

   http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

 - {u,s}time in task_struct are increased every time when the thread
   is interrupted by a tick (timer interrupt).

 - When a thread exits, its {u,s}time are added to signal->{u,s}time,
   after adjusted by task_times().

 - When all threads in a thread_group exits, accumulated {u,s}time
   (and also c{u,s}time) in signal struct are added to c{u,s}time
   in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

 - task_times(), to get cputime of a thread:
   This function returns adjusted values that originates from raw
   {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

 - thread_group_cputime(), to get cputime of a thread group:
   This function returns sum of all {u,s}time of living threads in
   the group, plus {u,s}time in the signal struct that is sum of
   adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

  group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

  group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

 - Modify thread's exit (= __exit_signal()) not to use task_times().
   It means {u,s}time in signal struct accumulates raw values instead
   of adjusted values.  As the result it makes thread_group_cputime()
   to return pure sum of "raw" values.

 - Introduce a new function thread_group_times(*task, *utime, *stime)
   that converts "raw" values of thread_group_cputime() to "adjusted"
   values, in same calculation procedure as task_times().

 - Modify group's exit (= wait_task_zombie()) to use this introduced
   thread_group_times().  It make c{u,s}time in signal struct to
   have adjusted values like before this patch.

 - Replace some thread_group_cputime() by thread_group_times().
   This replacements are only applied where conveys the "adjusted"
   cputime to users, and where already uses task_times() near by it.
   (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

 - Before this patch, if a group contains many short-life threads
   (e.g. runs 0.9ms and not interrupted by ticks), the group's
   cputime could be invisible since thread's cputime was accumulated
   after adjusted: imagine adjustment function as adj(ticks, runtime),
     {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
   After this patch it will not happen because the adjustment is
   applied after accumulated.

v2:
 - remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 17:32:40 +01:00
..
array.c sched, cputime: Introduce thread_group_times() 2009-12-02 17:32:40 +01:00
base.c fs/proc/base.c: fix proc_fault_inject_write() input sanity check 2009-09-23 07:39:40 -07:00
cmdline.c proc: switch /proc/cmdline to seq_file 2008-10-23 14:29:04 +04:00
cpuinfo.c proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c 2008-10-23 15:05:11 +04:00
devices.c proc: move /proc/devices code to fs/proc/devices.c 2008-10-23 15:02:18 +04:00
generic.c proc 1/2: do PDE usecounting even for ->read_proc, ->write_proc 2009-03-31 01:14:27 +04:00
inode.c proc 2/2: remove struct proc_dir_entry::owner 2009-03-31 01:14:44 +04:00
internal.h Move junk from proc_fs.h to fs/proc/internal.h 2009-06-11 21:36:01 -04:00
interrupts.c proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c 2008-10-23 15:15:46 +04:00
Kconfig proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig 2008-10-10 04:18:57 +04:00
kcore.c fs: includecheck fix: proc, kcore.c 2009-10-08 07:36:38 -07:00
kmsg.c proc: move /proc/kmsg creation to fs/proc/kmsg.c 2008-10-23 14:35:08 +04:00
loadavg.c sched, timers: cleanup avenrun users 2009-05-15 15:32:45 +02:00
Makefile proc: export statistics for softirq to /proc 2009-06-18 13:03:41 -07:00
meminfo.c hwpoison: fix/proc/meminfo alignment 2009-10-29 07:39:25 -07:00
mmu.c fs/proc/mmu.c: headers butchery 2007-10-17 08:42:48 -07:00
nommu.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
page.c pagemap: export KPF_HWPOISON 2009-10-08 07:36:39 -07:00
proc_devtree.c procfs: remove sparse errors in proc_devtree.c 2009-06-18 13:03:41 -07:00
proc_net.c proc: stop using BKL 2009-01-05 12:27:44 +03:00
proc_sysctl.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
proc_tty.c proc tty: remove struct tty_operations::read_proc 2009-04-01 08:59:10 -07:00
root.c Convert obvious places to deactivate_locked_super() 2009-05-09 10:49:40 -04:00
softirqs.c proc: export statistics for softirq to /proc 2009-06-18 13:03:41 -07:00
stat.c sched, cpuacct: Fix niced guest time accounting 2009-10-25 17:31:30 +01:00
task_mmu.c procfs: provide stack information for threads 2009-09-23 07:39:41 -07:00
task_nommu.c mm_for_maps: shift down_read(mmap_sem) to the caller 2009-08-10 20:48:32 +10:00
uptime.c [PATCH] Fix idle time field in /proc/uptime 2009-09-24 10:16:24 +02:00
version.c proc: switch /proc/version to seq_file 2008-10-23 14:19:58 +04:00
vmcore.c proc: vmcore - use kzalloc in get_new_element() 2009-06-18 13:03:41 -07:00