Skip to content

Commit e33a9bb

Browse files
htejunIngo Molnar
authored andcommitted
sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler
For an interface to support blocking for IOs, it must call io_schedule() instead of schedule(). This makes it tedious to add IO blocking to existing interfaces as the switching between schedule() and io_schedule() is often buried deep. As we already have a way to mark the task as IO scheduling, this can be made easier by separating out io_schedule() into multiple steps so that IO schedule preparation can be performed before invoking a blocking interface and the actual accounting happens inside the scheduler. io_schedule_timeout() does the following three things prior to calling schedule_timeout(). 1. Mark the task as scheduling for IO. 2. Flush out plugged IOs. 3. Account the IO scheduling. done close to the actual scheduling. This patch moves #3 into the scheduler so that later patches can separate out preparation and finish steps from io_schedule(). Patch-originally-by: Peter Zijlstra <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
1 parent b8fd842 commit e33a9bb

File tree

1 file changed

+61
-7
lines changed

1 file changed

+61
-7
lines changed

kernel/sched/core.c

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2089,11 +2089,24 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
20892089
p->sched_contributes_to_load = !!task_contributes_to_load(p);
20902090
p->state = TASK_WAKING;
20912091

2092+
if (p->in_iowait) {
2093+
delayacct_blkio_end();
2094+
atomic_dec(&task_rq(p)->nr_iowait);
2095+
}
2096+
20922097
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
20932098
if (task_cpu(p) != cpu) {
20942099
wake_flags |= WF_MIGRATED;
20952100
set_task_cpu(p, cpu);
20962101
}
2102+
2103+
#else /* CONFIG_SMP */
2104+
2105+
if (p->in_iowait) {
2106+
delayacct_blkio_end();
2107+
atomic_dec(&task_rq(p)->nr_iowait);
2108+
}
2109+
20972110
#endif /* CONFIG_SMP */
20982111

20992112
ttwu_queue(p, cpu, wake_flags);
@@ -2143,8 +2156,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
21432156

21442157
trace_sched_waking(p);
21452158

2146-
if (!task_on_rq_queued(p))
2159+
if (!task_on_rq_queued(p)) {
2160+
if (p->in_iowait) {
2161+
delayacct_blkio_end();
2162+
atomic_dec(&rq->nr_iowait);
2163+
}
21472164
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
2165+
}
21482166

21492167
ttwu_do_wakeup(rq, p, 0, rf);
21502168
ttwu_stat(p, smp_processor_id(), 0);
@@ -2956,6 +2974,36 @@ unsigned long long nr_context_switches(void)
29562974
return sum;
29572975
}
29582976

2977+
/*
2978+
* IO-wait accounting, and how its mostly bollocks (on SMP).
2979+
*
2980+
* The idea behind IO-wait account is to account the idle time that we could
2981+
* have spend running if it were not for IO. That is, if we were to improve the
2982+
* storage performance, we'd have a proportional reduction in IO-wait time.
2983+
*
2984+
* This all works nicely on UP, where, when a task blocks on IO, we account
2985+
* idle time as IO-wait, because if the storage were faster, it could've been
2986+
* running and we'd not be idle.
2987+
*
2988+
* This has been extended to SMP, by doing the same for each CPU. This however
2989+
* is broken.
2990+
*
2991+
* Imagine for instance the case where two tasks block on one CPU, only the one
2992+
* CPU will have IO-wait accounted, while the other has regular idle. Even
2993+
* though, if the storage were faster, both could've ran at the same time,
2994+
* utilising both CPUs.
2995+
*
2996+
* This means, that when looking globally, the current IO-wait accounting on
2997+
* SMP is a lower bound, by reason of under accounting.
2998+
*
2999+
* Worse, since the numbers are provided per CPU, they are sometimes
3000+
* interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
3001+
* associated with any one particular CPU, it can wake to another CPU than it
3002+
* blocked on. This means the per CPU IO-wait number is meaningless.
3003+
*
3004+
* Task CPU affinities can make all that even more 'interesting'.
3005+
*/
3006+
29593007
unsigned long nr_iowait(void)
29603008
{
29613009
unsigned long i, sum = 0;
@@ -2966,6 +3014,13 @@ unsigned long nr_iowait(void)
29663014
return sum;
29673015
}
29683016

3017+
/*
3018+
* Consumers of these two interfaces, like for example the cpufreq menu
3019+
* governor are using nonsensical data. Boosting frequency for a CPU that has
3020+
* IO-wait which might not even end up running the task when it does become
3021+
* runnable.
3022+
*/
3023+
29693024
unsigned long nr_iowait_cpu(int cpu)
29703025
{
29713026
struct rq *this = cpu_rq(cpu);
@@ -3377,6 +3432,11 @@ static void __sched notrace __schedule(bool preempt)
33773432
deactivate_task(rq, prev, DEQUEUE_SLEEP);
33783433
prev->on_rq = 0;
33793434

3435+
if (prev->in_iowait) {
3436+
atomic_inc(&rq->nr_iowait);
3437+
delayacct_blkio_start();
3438+
}
3439+
33803440
/*
33813441
* If a worker went to sleep, notify and ask workqueue
33823442
* whether it wants to wake up a task to maintain
@@ -5075,19 +5135,13 @@ EXPORT_SYMBOL_GPL(yield_to);
50755135
long __sched io_schedule_timeout(long timeout)
50765136
{
50775137
int old_iowait = current->in_iowait;
5078-
struct rq *rq;
50795138
long ret;
50805139

50815140
current->in_iowait = 1;
50825141
blk_schedule_flush_plug(current);
50835142

5084-
delayacct_blkio_start();
5085-
rq = raw_rq();
5086-
atomic_inc(&rq->nr_iowait);
50875143
ret = schedule_timeout(timeout);
50885144
current->in_iowait = old_iowait;
5089-
atomic_dec(&rq->nr_iowait);
5090-
delayacct_blkio_end();
50915145

50925146
return ret;
50935147
}

0 commit comments

Comments
 (0)