Understanding I/O Wait (or why 0% Idle can be OK)

During the development of an embedded Linux based product it’s not uncommon to reach a point where the features of the product are seemingly demanding too much from the CPU. When this happens and features start to stutter the first point of call is often ‘top‘ – this often results in the dreaded realization that the system has or is near to 0% idle time. At this point it’s very easy to get mislead by the myriad of statistics output from performance measuring utilities such as ‘top’ which often lead to optimization efforts that have little effect.

This post explores the I/O wait statistic and demonstrates how a system with 0% idle time doesn’t mean you’re demanding too much CPU.

Let’s start with the assumption that a CPU core can exist in one of two states: executing an instruction (busy) or doing nothing useful (idle). This gives us the following starting point:

Time Busy + Time Idle = 100%.

The statistics output from utilities such as ‘top’ generally provide additional detail to the ‘Time Busy’ and ‘Time Idle’ figures in the above equation. For example consider the busybox ‘top’ output from an embedded board:

In this case, ‘top’ has broken down ‘Time Busy’ into the sum of time spent by user tasks ‘usr’, kernel tasks ‘sys’, user tasks that have been nice’d ‘nic’ and time spent handling interrupts ‘irq’/’sirq’. Most importantly ‘top’ has broken down the ‘Time Idle’ into the sum of time taken doing nothing ‘idle’ and time taken doing nothing whilst there were I/O transactions ongoing ‘io’. Let’s expand our equation:

The important thing to note here is that I/O wait time ‘io’ is idle time – the CPU has nothing useful to do. To establish a definition of I/O wait time and illustrate this point we will briefly explore the kernel sources.

Userspace tools such as ‘top’ obtain their figures for I/O wait and idle time by reading counter values from the kernel, one source of this information is the /proc/stat file. The implementation for this file is provided in fs/proc/stat.c – if you follow the source you’ll find that the kernel keeps counters for I/O wait and idle time in ‘tick_sched‘ structure fields named ‘iowait_sleeptime‘ and ‘idle_sleeptime‘. These fields are written in a very interesting function in the kernel/time/tick-sched.c file as follows:

1	static void
2	update_ts_time_stats(int cpu, struct tick_sched ts, ktime_t now, u64 last_update_time)
3	{
4	ktime_t delta;
5
6	if (ts->idle_active) {
7	delta = ktime_sub (now, ts->idle_entrytime);
8	if (nr_iowait_cpu(cpu) > 0)
9	ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
10	else
11	ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
12	ts->idle_entrytime = now;
13	}
14
15	if (last_update_time)
16	*last_update_time = ktime_to_us(now);
17
18	}

This function shows that when the processor is idle (ts->idle_active != NULL) then the idle time is accounted for in either ‘iowait_sleeptime’ or ‘idle_sleeptime’ depending on the value returned from ‘nr_iowait_cpu’. In other words, I/O time i time that would have otherwise been accounted for idle time if it were not for pending I/O transactions. We can confirm the following:

The ‘nr_iowait_cpu‘ function simply provides the value of a ‘nr_iowait’ variable for the CPU which is set in the following function in the kernel/sched/core.c file:

1	/*
2	* This task is about to go to sleep on IO. Increment rq->nr_iowait so
3	* that process accounting knows that this is a task in IO wait state.
4	*/
5	long_sched io_schedule_timeout(long timeout)
6	{
7	int old_iowait = current ->in_iowait;
8	struct rq * rq;
9	long ret;
10
11	current->in_iowait = 1;
12	blk_schedule_flush_plug(current);
13
14	delayacct_blkio_start();
15	rq = raw_rq();
16	atomic_inc(&rq->nr_iowait);
17	ret = schedule_timeout(timeout);
18	current->in_iowait = old_iowait;
19	atomic_dec(&rq->nr_iowait);
20	delayacct_blkio_end();
21
22	return ret;
23	}
24	EXPORT_SYMBOL(io_schedule_timeout);

The function indicates that when a process can’t run anymore because it is waiting for I/O then the ‘nr_iowait’ count is incremented. Thus the ‘nr_iowait’ variable counts the number of processes on the current CPU that are not running because they are waiting for I/O to complete – for example waiting for a disk to provide data. This function is used throughout the kernel when I/O devices are waiting for I/O.

Our exploration of the kernel sources now allows us to provide a definition for I/O wait. I/O wait is the amount of CPU idle time spent where there was at least one process blocked on I/O.

Now that we understand the theory we can illustrate I/O wait time and its relationship to idle and busy time through some experiments on an embedded board. We will start by booting a board and setting the init process to /bin/sh – this results in a system that isn’t doing very much. Let’s look at the output from ‘top’:

As expected the system is mostly doing nothing – this is illustrated by an idle time of 99%.

We can now put the system under load and will do this with the following command:

1	/ $ dd if=/dev/urandom of=/dev/null

Let’s take another look at our ‘top’ output:

As expected the CPU is now fully utilized, in this case, it’s spending all of its time in the kernel – most likely generating random numbers for our ‘dd’ process.

Now that we’ve mastered the basics, let’s throw I/O into the mix. This time we will see the effect of reading data from an MMC card with the following command:

1	/ $ dd if=/dev/mmcblk0 of=/dev/null

Let’s take another look at our ‘top’ output:

This time we see that 4% of the time the kernel is doing nothing whilst waiting for I/O to complete. There is no idle time (‘idle’) as the sole task in this system ‘dd’ is using as much CPU as it can requesting data from the SD card and providing it to /dev/null, when it can’t do that it is because it’s blocked on I/O – as ‘dd’ is single threaded it can only do one thing at a time thus it reads and then copies (to /dev/null) sequentially

To make things a little more interesting we’ve modified our MMC driver to reduce its clock rate. This has the effect of I/O transactions taking longer, let’s repeat the same test and view the ‘top’ output:

As you may have expected, I/O wait time has increased – this is because the kernel is spending more of its time waiting for MMC I/O to complete.

You may be interested to know what would happen if the I/O transaction itself was a CPU intensive operation – we can test this by adding a busy-wait delay in the MMC driver prior to its call to mmc_request_done. This has the effect of using CPU cycles during each I/O operation whilst the userspace process is blocked.

As you can see I/O wait time has reduced to 58%, idle time is still 0% but the sirq time has increased to 32%. This is time where the CPU was busy executing instructions in the context of handling soft IRQs – this is the context where we added our delay. This illustrates that regardless to the reason for CPU cycles – they are always shown as busy time and not included in I/O wait time – thus I/O wait time is idle time.

You may have noticed that the previous tests all show 0% idle time and some I/O wait time. Let’s see what happens if we add some more load to the CPU – this time we’ll run our original ‘dd‘ test as a low priority process:

1	/ $ nice – n 19 dd if=/dev/urandom of=/dev/null &
2	/ $ dd if=/dev/mmcblk0 of=/dev/null

And the ‘top’ output:

This time the low priority ‘dd’ task has used up the I/O wait idle time that we previously saw when ‘dd’ was waiting for I/O to complete. The CPU is now completely loaded and there is no idle time left as both ‘idle’ and ‘io’ are 0%.

There is one more test we can perform to demonstrate that I/O wait time is idle time. We’ll start by measuring the real-world time it takes to transfer data from our MMC card and the time it takes to generate a set amount of random numbers.

1	/ $ time nice – n 20 dd if=/dev/urandom of=/dev/null bs=1M count=5
2	5+0 records in
3	5+0 records out
4	real 0m 3.09s
5	user 0m 0.00s
6	sys 0m 3.08s
7	/ $ time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=20
8	20+0 records in
9	20+0 records out
10	real 0m 19.78s
11	user 0m 0.01s
12	sys 0m 0.35s

As you can see we read 20MB of data from the SD card in ~19 seconds and generated 5MB of random numbers in ~3 seconds. Now let’s perform both operations at the same time and see what happens.

1	/ $ time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=20 & time nice -n 20 dd if =
2	/dev/urandom of=/dev/null bs=1M count=5 &
3	/ #5+0 records in
4	5+0 records out
5	real 0m 18.13s
6	user 0m 0.00s
7	sys 0m 17.69s
8	20+0 records in
9	20+0 records out
10	real 0m 19.65s
11	user 0m 0.01s
12	sys 0m 0.36s

We can see that the SD card read operation took the same amount of time of ~19 seconds. This is interesting as we know that when this is the sole task running the system shows 0% ‘idle’ time and some ‘io’ time. However, in this case, we also generated random numbers as a low priority task – this task started and stopped whilst the SD card operation was still running. We can see that the random number task was able to complete its work load whilst the SD task was running without having any impact on its performance – this demonstrates that there was enough CPU cycles free to perform this task and we can deduce those cycles were used whilst we were waiting for I/O. This shows again that I/O wait time is time that can be spent on the CPU.

The lesson here is that if you are debugging performance issues on a system that shows 0% idle time and a non-zero amount of I/O wait time then the issue you are debugging is not a lack of free CPU cycles. If you are investigating CPU loads then it may be more appropriate to consider Idle time as ‘idle’ + ‘io’.

First Publication date: 08/31/2015 by Embedded Bits