activate task_hot() only for fair-scheduled tasks (i.e. disable it
for RT tasks).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
reintroduce a simplified version of cache-hot/cold scheduling
affinity. This improves performance with certain SMP workloads,
such as sysbench.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
speed up context-switches a bit by not clearing p->exec_start.
(as a side-effect, this also makes p->exec_start a universal timestamp
available to cache-hot estimations.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Generate uevents when a user is being created/destroyed. These events
can be used to configure cpu share of a new user.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do not normalize kernel threads via SysRq-N: the migration threads,
softlockup threads, etc. might be essential for the system to
function properly. So only zap user tasks.
pointed out by Andi Kleen.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove stale comment from sched_group_set_shares().
Function never returns -EINVAL.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace a particularly ugly ifdef with an inline and a new macro.
Also split up the function to be easier to read.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Refactor common code of sleep_on / wait_for_completion
These functions were largely cut'n'pasted. This moves
the common code into single helpers instead. Advantage
is about 1k less code on x86-64 and 91 lines of code removed.
It adds one function call to the non timeout version of
the functions; i don't expect this to be measurable.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent wakeup over-scheduling. Once a task has been preempted by a
task of the same or lower priority, it becomes ineligible for repeated
preemption by same until it has been ticked, or slept. Instead, the
task is marked for preemption at the next tick. Tasks of higher
priority still preempt immediately.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement feature bit to disable forced preemption. This way
it can be checked whether a workload is overscheduling or not.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The following patch (sched: disable sleeper_fairness on SCHED_BATCH)
seems to break GROUP_SCHED. Although, it may be 'oops'-less due to the
possibility of 'p' being always a valid address.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cache_nice_tries and flags entry do not appear in proc fs sched_domain
directory, because ctl_table entry is skipped.
This patch fixes the issue.
Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
yield() in wait_task_inactive(), can cause a high priority thread to be
scheduled back in, and there by loop forever while it is waiting for some
lower priority thread which is unfortunately still on the runqueue.
Use schedule_timeout_uninterruptible(1) instead.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Credit: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add tunables in sysfs to modify a user's cpu share.
A directory is created in sysfs for each new user in the system.
/sys/kernel/uids/<uid>/cpu_share
Reading this file returns the cpu shares granted for the user.
Writing into this file modifies the cpu share for the user. Only an
administrator is allowed to modify a user's cpu share.
Ex:
# cd /sys/kernel/uids/
# cat 512/cpu_share
1024
# echo 2048 > 512/cpu_share
# cat 512/cpu_share
2048
#
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
disable sleeper fairness for batch tasks - they are about
batch processing after all.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
unit mis-match: wakeup_gran was used against a vruntime
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
export cpu_clock() - the preferred API instead of sched_clock().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
noticed by Peter Zijlstra:
fix: move the CPU check into ->task_new_fair(), this way we
can call place_entity() and get child ->vruntime right at
initial wakeup time.
(without this there can be large latencies)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
kfree(NULL) is valid.
pointed out by checkpatch.pl.
the fix shrinks the code a bit:
text data bss dec hex filename
40024 3842 100 43966 abbe sched.o.before
40002 3842 100 43944 aba8 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
run sched_domain_debug() if CONFIG_SCHED_DEBUG=y, instead
of relying on the hand-crafted SCHED_DOMAIN_DEBUG switch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- make timeslices of SCHED_RR tasks constant and not
dependent on task's static_prio [1] ;
- remove obsolete code (timeslice related bits);
- make sched_rr_get_interval() return something more
meaningful [2] for SCHED_OTHER tasks.
[1] according to the following link, it's not compliant with SUSv3
(not sure though, what is the reference for us :-)
http://lkml.org/lkml/2007/3/7/656
[2] the interval is dynamic and can be depicted as follows "should a
task be one of the runnable tasks at this particular moment, it would
expect to run for this interval of time before being re-scheduled by the
scheduler tick".
(i.e. it's more precise if a task is runnable at the moment)
yeah, this seems to require task_rq_lock/unlock() but this is not a hot
path.
results:
(SCHED_FIFO)
dimm@earth:~/storage/prog$ sudo chrt -f 10 ./rr_interval
time_slice: 0 : 0
(SCHED_RR)
dimm@earth:~/storage/prog$ sudo chrt 10 ./rr_interval
time_slice: 0 : 99984800
(SCHED_NORMAL)
dimm@earth:~/storage/prog$ ./rr_interval
time_slice: 0 : 19996960
(SCHED_NORMAL + a cpu_hog of similar 'weight' on the same CPU --- so should be a half of the previous result)
dimm@earth:~/storage/prog$ taskset 1 ./rr_interval
time_slice: 0 : 9998480
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
optimize schedule() a bit on SMP, by moving the rq-clock update
outside the rq lock.
code size is the same:
text data bss dec hex filename
25725 2666 96 28487 6f47 sched.o.before
25725 2666 96 28487 6f47 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
The thing is that __pick_next_entity() must never be called when
first_fair(cfs_rq) == NULL. It wouldn't be a problem, should 'run_node'
be the very first field of 'struct sched_entity' (and it's the second).
The 'nr_running != 0' check is _not_ enough, due to the fact that
'current' is not within the tree. Generic paths are ok (e.g. schedule()
as put_prev_task() is called previously)... I'm more worried about e.g.
migration_call() -> CPU_DEAD_FROZEN -> migrate_dead_tasks()... if
'current' == rq->idle, no problems.. if it's one of the SCHED_NORMAL
tasks (or imagine, some other use-cases in the future -- i.e. we should
not make outer world dependent on internal details of sched_fair class)
-- it may be "Houston, we've got a problem" case.
it's +16 bytes to the ".text". Another variant is to make 'run_node' the
first data member of 'struct sched_entity' but an additional check (se !
= NULL) is still needed in pick_next_entity().
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Make vslice accurate wrt nice levels, and add some comments
while we're at it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
mark scheduling classes as const. The speeds up the code
a bit and shrinks it:
text data bss dec hex filename
40027 4018 292 44337 ad31 sched.o.before
40190 3842 292 44324 ad24 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
There is a possibility that because of task of a group moving from one
cpu to another, it may gain more cpu time that desired. See
http://marc.info/?l=linux-kernel&m=119073197730334 for details.
This is an attempt to fix that problem. Basically it simulates dequeue
of higher level entities as if they are going to sleep. Similarly it
simulate wakeup of higher level entities as if they are waking up from
sleep.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
group scheduler SMP migration fix: use task_cfs_rq(p) to get
to the relevant fair-scheduling runqueue of a task, rq->cfs
is not the right one.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
rename all 'cnt' fields and variables to the less yucky 'count' name.
yuckage noticed by Andrew Morton.
no change in code, other than the /proc/sched_debug bkl_count string got
a bit larger:
text data bss dec hex filename
38236 3506 24 41766 a326 sched.o.before
38240 3506 24 41770 a32a sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
fix yield bugs due to the current-not-in-rbtree changes: the task is
not in the rbtree so rbtree-removal is a no-no.
[ From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>: build fix. ]
also, nice code size reduction:
kernel/sched.o:
text data bss dec hex filename
38323 3506 24 41853 a37d sched.o.before
38236 3506 24 41766 a326 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>