Building a dynamic QOS with eBPF and Cgroups

In my previous posts I have shown how the SolidFire Array QOS  implements QOS on a per volume basis. As I was looking for a new project I asked myself could I write a similar QOS program which runs on a virtual machine?

I wondered could I dynamically change the Linux built-in QOS(throttling) which is implemented via cgroups based on a SolidFire QOS curve.

To get this working I would need to look at the I/O rate entering the block layer in the Linux kernel and writing a kernel module is not something I could do given I’m not great at c programming.  However as I had contributed to the Iovisor BCC project a few years ago and thought this was exactly the tool set to help me achieve this. For further information see https://github.com/iovisor/bcc

Methodology

  • Trace IOPS and BW for for each disk entering the block layer.
    • Using (eBPF)
  • Return these back to user space,work out average I/O and QOS.
    • Using (eBPF) & python
  • Apply QOS settings based on the read and write average I/O size.
    • write maj:min qos to blkio.throttle.read/write_iops_device

A Quick SolidFire QOS Overview

QOS on the SolidFire works by rate limiting a volume coming into the array, this is based on an average I/O size normalised against a 4Kb I/O size cost

If we were to set the disks maximum QOS to 40K IOPS based on a 4Kb I/O size we would get 25K IOPS for 8Kb based on the SolidFire QOS curve.

root:~/bcc/tools# ./table.py 40000 40000 40000 |more
IOSIZE   COST     MINIOPS  MAXIOPS  BIOPS    MINMB    MAXMB    BURSTMB
4        100      40000    40000    40000    156      156      156
5        115      34783    34783    34783    170      170      170
6        130      30769    30769    30769    180      180      180
7        145      27586    27586    27586    189      189      189
8        160      25000    25000    25000    195      195      195

SolidFire uses a total IOPS figure for its QOS however linux cgroups allow a split of read and write throttling, this is much more flexible so in my example I will base the QOS for both reads and writes.

Testing

For testing QOS I used fio which is available at https://github.com/axboe/fio as it allows great flexibility. The first test I wanted to do was some random reads to see what I could expect from my VM.

Running the following fio random read test we can see from the iostat data that we are getting over 40K IOPS @ 8Kb.

; fio-rand-read.job for fiotest

[global]
name=fio-rand-read
filename=fio-rand-read
rw=randread
bs=64K
direct=1
numjobs=1
time_based=1
runtime=30

[file1]
size=3G
ioengine=libaio
iodepth=16

Here we can see an sample of the iostat output which we are getting ~45K IOPS and 360MB/sec.

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda           46805.00    0.00    365.66      0.00     0.00     0.00   0.00   0.00    0.32    0.00  14.95     8.00     0.00   0.02 100.00
sda           47065.00    0.00    367.70      0.00     0.00     0.00   0.00   0.00    0.31    0.00  14.82     8.00     0.00   0.02 100.00
sda           46958.00    0.00    366.86      0.00     0.00     0.00   0.00   0.00    0.32    0.00  14.82     8.00     0.00   0.02 100.00
sda           47319.00   17.00    369.68      0.07     0.00     0.00   0.00   0.00    0.32    0.00  15.02     8.00     4.00   0.02 100.00
sda           46877.00    0.00    366.22      0.00     0.00     0.00   0.00   0.00    0.32    0.00  14.84     8.00     0.00   0.02 100.00

As below we can see that for 40K IOPS we will get 25K Max IOPS at 8kb.

root:~/bcc/tools# ./table.py 20000 40000 50000
IOSIZE   COST     MINIOPS  MAXIOPS  BIOPS    MINMB    MAXMB    BURSTMB
4        100      20000    40000    50000    78       156      195
8        160      12500    25000    31250    98       195      244

Setup

To setup our QOS for each disk we need to enter the major:minor max_iops into a file name qos_setup, an example is below.

root:~/bcc/tools# cat qos_setup
8:0 40000

The first thing we need to do is make sure the process and it’s children can be controlled via the cgroups , so lets add the pid to the blkio tasks.

root:/tmp# echo $$ > /sys/fs/cgroup/blkio/tasks
root:/tmp# fio ./randr

Looking at the QOS tooling output below we can see after 10 seconds we check the current average I/O size and set the QOS values.

root:~/bcc/tools# ./qos.py
Tracing... Output every 1 secs. Hit Ctrl-C to end
TIME     DISK  RIOPS    R MB/s   R_AvgIO  R_QOS    WIOPS    W_AvgIO  W MB/s   W_QOS
17:28:52 ???   0        0        0        0        0        0        0        0
17:28:53 ???   0        0        0        0        0        0        0        0
17:28:54 ???   0        0        0        0        0        0        0        0
17:28:55 sda   33612    262      8        25000    0        0        0        0
17:28:56 sda   44305    346      8        25000    0        0        0        0
17:28:57 sda   32150    251      8        25000    0        0        0        0
17:28:58 sda   8471     66       8        25000    0        0        0        0
17:28:59 sda   17276    134      8        25000    0        0        0        0
17:29:00 sda   20492    160      8        25000    0        0        0        0
17:29:01 sda   25000    195      8        25000    0        0        0        0
17:29:02 sda   25047    195      8        25000    0        0        0        0
17:29:03 sda   25045    195      8        25000    0        0        0        0
17:29:04 sda   25096    196      8        25000    0        0        0        0
17:29:05 sda   25043    195      8        25000    3        4        0        40000
17:29:06 sda   24987    195      8        25000    0        0        0        0
17:29:07 sda   25000    195      8        25000    0        0        0        0
17:29:08 sda   25000    195      8        25000    0        0        0        0
17:29:09 sda   25054    195      8        25000    0        0        0        0
17:29:10 sda   25066    195      8        25000    0        0        0        0
17:29:11 sda   25037    195      8        25000    0        0        0        0
17:29:12 sda   25064    195      8        25000    0        0        0        0
17:29:13 sda   24993    195      8        25000    0        0        0        0
17:29:14 sda   25000    195      8        25000    0        0        0        0
17:29:15 sda   25016    195      8        25000    0        0        0        0
17:29:16 sda   25068    195      8        25000    0        0        0        0
17:29:17 sda   25049    195      8        25000    0        0        0        0
17:29:18 sda   25069    195      8        25000    0        0        0        0
17:29:19 sda   25091    196      8        25000    0        0        0        0
17:29:20 sda   24984    195      8        25000    0        0        0        0
17:29:21 sda   25000    195      8        25000    0        0        0        0
17:29:22 sda   25040    195      8        25000    0        0        0        0
17:29:23 sda   25043    195      8        25000    0        0        0        0
17:29:24 sda   25077    195      8        25000    0        0        0        0
17:29:25 sda   4813     37       8        25000    0        0        0        0
17:29:26 sda   0        0        0        0        0        0        0        0

The output is pretty self explanatory the RIOPS/WIOPS are the current number of IOPS being returned from the kernel, the R_QOS and W_QOS are showing the max IOPS for the 40K IOP limit for the average I/O size.

We can see that after ten seconds of running the tool looks at the average I/O size and then adjusts the QOS values @ 17:29:14.

Lets run another test in this we will copy the fio script and run this with 64KB along with the 8Kb run.

root:~/bcc/tools# ./qos.py
Tracing... Output every 1 secs. Hit Ctrl-C to end
TIME     DISK  RIOPS    R MB/s   R_AvgIO  R_QOS    WIOPS    W_AvgIO  W MB/s   W_QOS
17:37:15 ???   0        0        0        0        0        0        0        0
17:37:16 ???   0        0        0        0        0        0        0        0
17:37:17 ???   0        0        0        0        0        0        0        0
17:37:18 sda   10000    78       8        25000    1        24       0        10389
17:37:19 sda   23157    180      8        25000    0        0        0        0
17:37:20 sda   14968    116      8        25000    0        0        0        0
17:37:21 sda   25000    195      8        25000    0        0        0        0
17:37:22 sda   24984    195      8        25000    6        4        0        40000
17:37:23 sda   25000    195      8        25000    0        0        0        0
17:37:24 sda   25016    195      8        25000    17       4        0        40000
17:37:25 sda   25063    195      8        25000    0        0        0        0
17:37:26 sda   25069    195      8        25000    0        0        0        0
17:37:27 sda   25069    195      8        25000    0        0        0        0
17:37:28 sda   25116    196      8        25000    0        0        0        0
17:37:29 sda   25000    195      8        25000    0        0        0        0
17:37:30 sda   25000    195      8        25000    0        0        0        0
17:37:31 sda   25016    195      8        25000    0        0        0        0
17:37:32 sda   25054    195      8        25000    0        0        0        0
17:37:33 sda   25102    196      8        25000    0        0        0        0
17:37:34 sda   25062    195      8        25000    1        4        0        40000
17:37:35 sda   25079    195      8        25000    0        0        0        0
17:37:36 sda   25000    195      8        25000    3        4        0        40000
17:37:37 sda   25000    195      8        25000    0        0        0        0
17:37:38 sda   25016    195      8        25000    0        0        0        0
17:37:39 sda   25064    195      8        25000    0        0        0        0
17:37:40 sda   25074    195      8        25000    0        0        0        0
17:37:41 sda   21844    263      12       18221    0        0        0        0
17:37:42 sda   14592    385      27       9336     0        0        0        0
17:37:43 sda   9327     243      26       9427     0        0        0        0
17:37:44 sda   9330     245      26       9368     0        0        0        0
17:37:45 sda   9362     247      27       9332     0        0        0        0
17:37:46 sda   9345     250      27       9207     3        4        0        40000
17:37:47 sda   9355     254      27       9092     0        0        0        0
17:37:48 sda   8038     320      40       6269     0        0        0        0
17:37:49 sda   6287     392      64       4000     0        0        0        0
17:37:50 sda   4000     250      64       4000     0        0        0        0
17:37:51 sda   4000     250      64       4000     0        0        0        0
17:37:52 sda   4000     250      64       4000     0        0        0        0
17:37:53 sda   4000     250      64       4000     0        0        0        0
17:37:54 sda   4032     211      53       4769     0        0        0        0
17:37:55 sda   4007     108      27       9110     0        0        0        0
17:37:56 sda   4019     110      28       8992     0        0        0        0
17:37:57 sda   8977     245      28       9038     0        0        0        0
17:37:58 sda   8990     243      27       9105     0        0        0        0
17:37:59 sda   9028     247      28       9003     0        0        0        0
17:38:00 sda   9002     239      27       9256     0        0        0        0
17:38:01 sda   9005     244      27       9085     0        0        0        0
17:38:02 sda   8995     246      28       9006     0        0        0        0
17:38:03 sda   9030     257      29       8709     0        0        0        0
17:38:04 sda   8698     251      29       8581     0        0        0        0
17:38:05 sda   8700     253      29       8545     0        0        0        0
17:38:06 sda   8700     251      29       8587     3        4        0        40000
17:38:07 sda   8700     248      29       8692     0        0        0        0
17:38:08 sda   8732     248      29       8726     0        0        0        0
17:38:09 sda   8713     249      29       8680     0        0        0        0
17:38:10 sda   8714     248      29       8714     2        4        0        40000
17:38:11 sda   8710     169      19       12243    0        0        0        0
17:38:12 sda   8710     68       8        25000    0        0        0        0
17:38:13 sda   8710     68       8        25000    0        0        0        0
17:38:14 sda   8710     68       8        25000    0        0        0        0
17:38:15 sda   8754     68       8        25000    0        0        0        0

So we can see the tool is running along at 8Kb(R_AvgIO) until 17:31 when the other workload kicks in we can see at 17:37:48 the original 8Kb workload ends and we move back to 64kb I/O size QOS(4000 IOPS). And we then re-start the 8k workload at 17:37:54 which then moves back to 8k ay 17:38:12.

So we can see that the tools is adjusting the QOS every ten seconds based on the workloads.

We can also run mixed read and write workloads, let’s look at an example.

; fio-rand-RW.job for fiotest

[global]
name=fio-rand-RW
filename=fio-rand-read
rw=randrw
rwmixread=60
rwmixwrite=40
bs=8K,32K
direct=1
time_based=1
runtime=30

[file1]
size=3G
ioengine=libaio
iodepth=16

So we will run 8k reads and 32k writes from our tooling we expect 25K read IOPS max and 8000 write IOPS max.

root:~/bcc/tools# ./table.py 20000 40000 40000 |egrep "IO|^8 |^32 "
IOSIZE   COST     MINIOPS  MAXIOPS  BIOPS    MINMB    MAXMB    BURSTMB
8        160      12500    25000    25000    98       195      195
32       500      4000     8000     8000     125      250      250

Lets baseline this and record the iostat data.

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda             77.54   19.95      0.87      0.46     0.01     0.17   0.01   0.85    0.46    1.97   0.07    11.52    23.42   0.08   0.76
sda           5626.00 3814.00     43.95    118.75     0.00     0.00   0.00   0.00    0.31    0.98   5.47     8.00    31.88   0.04  36.00
sda           15712.00 10595.00    122.75    331.09     0.00     0.00   0.00   0.00    0.30    1.00  15.30     8.00    32.00   0.04 100.00
sda           15607.00 10546.00    121.93    329.56     0.00     0.00   0.00   0.00    0.30    0.98  14.91     8.00    32.00   0.04  99.60
sda           15585.00 10519.00    121.76    328.72     0.00     0.00   0.00   0.00    0.30    0.98  15.00     8.00    32.00   0.04 100.00
sda           15546.00 10494.00    121.45    327.94     0.00     0.00   0.00   0.00    0.30    0.97  14.90     8.00    32.00   0.04  99.60
sda           15641.00 10589.00    122.20    330.82     0.00     0.00   0.00   0.00    0.29    0.98  14.98     8.00    31.99   0.04 100.00
sda           15803.00 10643.00    123.46    332.59     0.00     0.00   0.00   0.00    0.29    0.97  14.98     8.00    32.00   0.04 100.00
sda           15643.00 10566.00    122.21    330.19     0.00     0.00   0.00   0.00    0.29    0.98  14.97     8.00    32.00   0.04 100.00
sda           15523.00 10482.00    121.27    327.56     0.00     0.00   0.00   0.00    0.31    0.97  14.88     8.00    32.00   0.04  99.60
sda           15754.00 10644.00    123.08    332.62     0.00     0.00   0.00   0.00    0.29    0.98  15.04     8.00    32.00   0.04 100.00
sda           15758.00 10634.00    123.11    332.20     0.00     0.00   0.00   0.00    0.30    0.95  14.85     8.00    31.99   0.04  99.20
sda           15702.00 10614.00    122.67    331.69     0.00     0.00   0.00   0.00    0.30    0.96  14.93     8.00    32.00   0.04 100.00
sda           15733.00 10607.00    122.91    331.44     0.00     0.00   0.00   0.00    0.30    0.97  15.01     8.00    32.00   0.04 100.00
sda           15692.00 10595.00    122.59    331.09     0.00     0.00   0.00   0.00    0.27    0.98  14.70     8.00    32.00   0.04 100.00
sda           15617.00 10557.00    122.01    329.91     0.00     0.00   0.00   0.00    0.31    0.97  15.12     8.00    32.00   0.04  99.20
sda           15592.00 10532.00    121.81    329.04     0.00     0.00   0.00   0.00    0.30    0.98  15.09     8.00    31.99   0.04 100.00
sda           15383.17 10385.15    120.18    324.54     0.00     0.00   0.00   0.00    0.30    0.99  14.87     8.00    32.00   0.04  99.01
sda           15683.00 10582.00    122.52    330.69     0.00     0.00   0.00   0.00    0.30    0.98  15.02     8.00    32.00   0.04  99.60
sda           15774.00 10650.00    123.23    332.75     0.00     0.00   0.00   0.00    0.29    0.97  14.92     8.00    31.99   0.04 100.00
sda           15614.00 10549.00    121.98    329.72     0.00     0.00   0.00   0.00    0.30    0.98  15.04     8.00    32.01   0.04  99.60
sda           15678.00 10594.00    122.48    330.98     0.00     0.00   0.00   0.00    0.30    0.97  15.06     8.00    31.99   0.04 100.00
sda           15677.00 10562.00    122.48    330.06     0.00     0.00   0.00   0.00    0.31    0.96  14.93     8.00    32.00   0.04 100.00
sda           15733.00 10655.00    122.91    332.56     0.00     0.00   0.00   0.00    0.29    0.98  14.98     8.00    31.96   0.04  98.80
sda           15719.00 10607.00    122.80    331.47     0.00     0.00   0.00   0.00    0.29    0.98  14.96     8.00    32.00   0.04 100.00
sda           15436.00 10434.00    120.59    326.06     0.00     0.00   0.00   0.00    0.31    0.99  15.16     8.00    32.00   0.04 100.00
sda           15605.00 10534.00    121.91    329.11     0.00     0.00   0.00   0.00    0.31    0.98  15.16     8.00    31.99   0.04 100.00
sda           15624.00 10538.00    122.06    329.31     0.00     0.00   0.00   0.00    0.31    0.93  14.66     8.00    32.00   0.04  98.80
sda           15444.00 10444.00    120.66    326.37     0.00     0.00   0.00   0.00    0.30    0.99  14.88     8.00    32.00   0.04  99.60
sda           15723.00 10590.00    122.84    330.94     0.00     0.00   0.00   0.00    0.30    0.94  14.78     8.00    32.00   0.04  99.60
sda           15109.00 10198.00    118.04    318.69     0.00     0.00   0.00   0.00    0.34    0.99  15.22     8.00    32.00   0.04 100.00
sda           9866.00 6672.00     77.08    208.34     0.00     0.00   0.00   0.00    0.29    1.00   9.50     8.00    31.97   0.04  62.80

We can see ~15K read and 10K write IOPS. So lets apply QOS and re-run.

root:~/bcc/tools# ./qos.py
Tracing... Output every 1 secs. Hit Ctrl-C to end
TIME     DISK  RIOPS    R MB/s   R_AvgIO  R_QOS    WIOPS    W_AvgIO  W MB/s   W_QOS
17:45:47 ???   0        0        0        0        0        0        0        0
17:45:48 ???   0        0        0        0        0        0        0        0
17:45:49 ???   0        0        0        0        0        0        0        0
17:45:50 sda   49       0        8        25000    30       32       0        8000
17:45:51 sda   15299    119      8        25000    10343    32       323      8000
17:45:52 sda   15491    121      8        25000    10447    32       326      8000
17:45:53 sda   15240    119      8        25000    10305    32       322      8000
17:45:54 sda   11901    92       8        25000    8000     32       250      8000
17:45:55 sda   11845    92       8        25000    8000     31       249      8002
17:45:56 sda   11839    92       8        25000    8000     32       250      8000
17:45:57 sda   11842    92       8        25000    8000     32       250      8000
17:45:58 sda   11835    92       8        25000    8000     32       250      8000
17:45:59 sda   11846    92       8        25000    8016     32       250      8000
17:46:00 sda   11874    92       8        25000    8008     31       250      8002
17:46:01 sda   11884    92       8        25000    8000     32       250      8000
17:46:02 sda   11837    92       8        25000    8000     32       250      8000
17:46:03 sda   11844    92       8        25000    8016     32       250      8000
17:46:04 sda   11849    92       8        25000    8011     32       250      8000
17:46:05 sda   11848    92       8        25000    7995     31       249      8003
17:46:06 sda   10741    83       8        25000    7253     32       226      8000
17:46:07 sda   11473    89       8        25000    7740     31       241      8003
17:46:08 sda   11865    92       8        25000    7999     32       249      8000
17:46:09 sda   11848    92       8        25000    8000     32       250      8000
17:46:10 sda   11830    92       8        25000    8000     31       249      8002
17:46:11 sda   11819    92       8        25000    7985     32       249      8000
17:46:12 sda   11877    92       8        25000    8015     32       250      8000
17:46:13 sda   11882    92       8        25000    8015     32       250      8000
17:46:14 sda   11886    92       8        25000    8029     32       250      8000
17:46:15 sda   11849    92       8        25000    7999     31       249      8002
17:46:16 sda   11845    92       8        25000    8000     32       250      8000
17:46:17 sda   11853    92       8        25000    8000     32       250      8000
17:46:18 sda   11867    92       8        25000    8000     32       250      8000
17:46:19 sda   11859    92       8        25000    8016     32       250      8000
17:46:20 sda   11064    86       8        25000    7495     32       234      8000
17:46:21 sda   0        0        0        0        3        4        0        40000
17:46:22 sda   0        0        0        0        0        0        0        0
17:46:23 sda   0        0        0        0        0        0        0        0
17:46:24 sda   0        0        0        0        0        0        0        0
^C17:46:25 sda   0        0        0        0        0        0        0        0

iostat data:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda           5158.00 3480.00     40.30    108.75     0.00     0.00   0.00   0.00    0.35    1.08   5.56     8.00    32.00   0.04  36.40
sda           15682.00 10576.00    122.52    330.50     0.00     0.00   0.00   0.00    0.30    0.95  14.80     8.00    32.00   0.04 100.00
sda           15261.00 10323.00    119.23    322.59     0.00     0.00   0.00   0.00    0.31    0.99  14.96     8.00    32.00   0.04 100.00
sda           14169.00 9566.00    110.70    298.94     0.00     0.00   0.00   0.00    0.32    0.96  13.73     8.00    32.00   0.04  92.00
sda           11856.00 8006.00     92.62    250.19     0.00     0.00   0.00   0.00    0.29    0.93  10.90     8.00    32.00   0.04  75.20
sda           11821.00 7978.00     92.35    249.23     0.00     0.00   0.00   0.00    0.30    0.96  11.20     8.00    31.99   0.04  76.00
sda           11859.00 7996.00     92.65    249.84     0.00     0.00   0.00   0.00    0.31    0.92  11.10     8.00    32.00   0.04  77.20
sda           11886.00 8026.00     92.86    250.84     0.00     0.00   0.00   0.00    0.31    0.94  11.18     8.00    32.00   0.04  77.20
sda           11681.00 7886.00     91.26    246.44     0.00     0.00   0.00   0.00    0.32    0.97  11.33     8.00    32.00   0.04  76.80
sda           11991.00 8112.00     93.68    253.50     0.00     0.00   0.00   0.00    0.30    0.94  11.22     8.00    32.00   0.04  77.20
sda           11762.00 7941.00     91.89    248.07     0.00     0.00   0.00   0.00    0.31    0.93  10.97     8.00    31.99   0.04  75.60
sda           11815.00 7980.00     92.30    249.38     0.00     0.00   0.00   0.00    0.31    0.94  11.16     8.00    32.00   0.04  76.40
sda           11781.00 7964.00     92.04    248.88     0.00     0.00   0.00   0.00    0.37    0.98  12.18     8.00    32.00   0.04  82.80
sda           11893.00 8010.00     92.91    250.31     0.00     0.00   0.00   0.00    0.47    0.95  13.21     8.00    32.00   0.05  89.60
sda           11831.00 7985.00     92.43    249.53     0.00     0.00   0.00   0.00    0.47    0.95  13.08     8.00    32.00   0.05  90.00
sda           10653.00 7214.00     83.23    225.33     0.00     0.00   0.00   0.00    0.46    1.21  13.66     8.00    31.98   0.05  92.00
sda           11691.00 7884.00     91.34    246.38     0.00     0.00   0.00   0.00    0.49    0.94  13.08     8.00    32.00   0.05  91.20
sda           11657.00 7860.00     91.07    245.52     0.00     0.00   0.00   0.00    0.35    1.00  11.88     8.00    31.99   0.04  81.20
sda           11834.00 7996.00     92.45    249.87     0.00     0.00   0.00   0.00    0.32    0.97  11.52     8.00    32.00   0.04  77.20
sda           11844.00 8002.00     92.53    250.06     0.00     0.00   0.00   0.00    0.34    0.94  11.52     8.00    32.00   0.04  78.00
sda           11802.00 7982.00     92.20    249.36     0.00     0.00   0.00   0.00    0.35    0.92  11.44     8.00    31.99   0.04  79.60
sda           11858.00 8008.00     92.64    250.25     0.00     0.00   0.00   0.00    0.30    0.93  10.99     8.00    32.00   0.04  77.60
sda           11866.00 8000.00     92.70    250.00     0.00     0.00   0.00   0.00    0.31    0.93  11.12     8.00    32.00   0.04  76.40
sda           11854.00 7993.00     92.61    249.78     0.00     0.00   0.00   0.00    0.30    0.95  11.15     8.00    32.00   0.04  76.40
sda           11736.00 7928.00     91.69    247.75     0.00     0.00   0.00   0.00    0.30    0.96  11.12     8.00    32.00   0.04  76.00
sda           11865.00 8008.00     92.70    250.17     0.00     0.00   0.00   0.00    0.31    0.94  11.21     8.00    31.99   0.04  77.20
sda           11839.00 7997.00     92.49    249.91     0.00     0.00   0.00   0.00    0.32    0.93  11.23     8.00    32.00   0.04  75.20
sda           11841.00 7997.00     92.51    249.91     0.00     0.00   0.00   0.00    0.31    0.96  11.38     8.00    32.00   0.04  77.20
sda           11836.00 8004.00     92.47    250.13     0.00     0.00   0.00   0.00    0.32    0.90  11.03     8.00    32.00   0.04  76.40
sda           11844.00 8001.00     92.53    250.03     0.00     0.00   0.00   0.00    0.31    0.95  11.23     8.00    32.00   0.04  76.80
sda           7501.00 5072.00     58.60    158.50     0.00     0.00   0.00   0.00    0.32    0.94   7.13     8.00    32.00   0.04  48.80

Here we have the OS pushing now only 11K read IOPS and bringing down the writes to the 8K IOP limit. This is below what iostat was showing in the baseline test so lets run with QOS enabled at the cgroup but without the monitoring tool.

First we need to set the cgroup QOS to what we expect it to be.

root:/tmp# echo '8:0 8000' > /sys/fs/cgroup/blkio/blkio.throttle.write_iops_device
root:/tmp# cat /sys/fs/cgroup/blkio/blkio.throttle.*iops*
8:0 25000
8:0 8000

And from the below iostat data we can see that it’s the cgroup QOS which is limiting the IOPS to 11K not the tooling.

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda           8150.00 5491.00     63.67    171.59     0.00     0.00   0.00   0.00    0.29    0.99   7.82     8.00    32.00   0.04  52.40
sda           10814.00 7297.00     84.48    228.03     0.00     0.00   0.00   0.00    0.38    1.04  11.77     8.00    32.00   0.04  78.40
sda           11813.00 7982.00     92.29    249.44     0.00     0.00   0.00   0.00    0.28    0.95  10.94     8.00    32.00   0.04  74.00
sda           11872.00 8021.00     92.75    250.55     0.00     0.00   0.00   0.00    0.29    0.96  11.11     8.00    31.99   0.04  74.40
sda           11838.00 8006.00     92.48    250.19     0.00     0.00   0.00   0.00    0.28    0.94  10.91     8.00    32.00   0.04  74.00
sda           11832.00 7991.00     92.44    249.72     0.00     0.00   0.00   0.00    0.29    0.97  11.08     8.00    32.00   0.04  74.40
sda           11867.00 8004.00     92.71    250.10     0.00     0.00   0.00   0.00    0.30    0.93  10.92     8.00    32.00   0.04  74.80
sda           11806.00 7969.00     92.23    249.03     0.00     0.00   0.00   0.00    0.30    0.95  11.11     8.00    32.00   0.04  74.80
sda           11899.00 8032.00     92.96    250.89     0.00     0.00   0.00   0.00    0.27    0.94  10.75     8.00    31.99   0.04  74.80
sda           11827.00 7989.00     92.40    249.66     0.00     0.00   0.00   0.00    0.28    0.96  11.06     8.00    32.00   0.04  74.80
sda           11845.00 8006.00     92.54    250.19     0.00     0.00   0.00   0.00    0.28    0.97  11.13     8.00    32.00   0.04  74.80
sda           11842.00 8002.00     92.52    250.01     0.00     0.00   0.00   0.00    0.28    0.96  11.02     8.00    31.99   0.04  74.00
sda           11850.00 7998.00     92.58    249.94     0.00     0.00   0.00   0.00    0.28    0.96  10.97     8.00    32.00   0.04  74.80
sda           11871.00 8004.00     92.74    250.04     0.00     0.00   0.00   0.00    0.28    0.96  11.00     8.00    31.99   0.04  74.80
sda           11776.00 7951.00     92.00    248.47     0.00     0.00   0.00   0.00    0.28    0.98  11.08     8.00    32.00   0.04  75.60
sda           11928.00 8053.00     93.19    251.66     0.00     0.00   0.00   0.00    0.31    0.92  11.08     8.00    32.00   0.04  75.60
sda           11848.00 8005.00     92.56    250.16     0.00     0.00   0.00   0.00    0.30    0.96  11.15     8.00    32.00   0.04  74.80
sda           11849.00 8008.00     92.57    250.25     0.00     0.00   0.00   0.00    0.30    0.95  11.11     8.00    32.00   0.04  75.20
sda           11835.00 7997.00     92.46    249.82     0.00     0.00   0.00   0.00    0.29    0.95  10.97     8.00    31.99   0.04  74.40
sda           11850.00 7985.00     92.58    249.53     0.00     0.00   0.00   0.00    0.28    0.95  10.81     8.00    32.00   0.04  75.60
sda           11857.00 7998.00     92.63    249.94     0.00     0.00   0.00   0.00    0.28    0.96  11.04     8.00    32.00   0.04  76.00
sda           11873.00 8019.00     92.76    250.59     0.00     0.00   0.00   0.00    0.28    0.94  10.82     8.00    32.00   0.04  74.80
sda           11824.00 7983.00     92.38    249.47     0.00     0.00   0.00   0.00    0.30    0.92  10.94     8.00    32.00   0.04  76.00
sda           11843.00 8011.00     92.52    250.26     0.00     0.00   0.00   0.00    0.28    0.97  11.16     8.00    31.99   0.04  75.60
sda           11717.82 7921.78     91.55    247.56     0.00     0.00   0.00   0.00    0.28    0.94  10.73     8.00    32.00   0.04  73.66
sda           11860.00 7994.00     92.66    249.81     0.00     0.00   0.00   0.00    0.29    0.95  11.00     8.00    32.00   0.04  74.40
sda           11860.00 7996.00     92.66    249.87     0.00     0.00   0.00   0.00    0.28    0.95  10.89     8.00    32.00   0.04  74.80
sda           11838.00 7991.00     92.48    249.72     0.00     0.00   0.00   0.00    0.30    0.96  11.16     8.00    32.00   0.04  76.80
sda           11854.00 8013.00     92.61    250.32     0.00     0.00   0.00   0.00    0.29    0.94  10.98     8.00    31.99   0.04  75.20
sda           11828.00 7986.00     92.41    249.56     0.00     0.00   0.00   0.00    0.29    0.95  10.98     8.00    32.00   0.04  75.20
sda           3733.00 2528.00     29.16     79.00     0.00     0.00   0.00   0.00    0.29    0.93   3.44     8.00    32.00   0.04  24.00

Whats the overhead??

Looking at the fio cpu usage with just cgroup QOS and no monitoring.

cpu          : usr=5.81%, sys=17.43%, ctx=212268, majf=0, minf=13

And with QOS monitoring running.

cpu          : usr=5.81%, sys=21.99%, ctx=193375, majf=0, minf=11

Conclusion

As you can see we can dynamically change the I/O size based on the average I/O size whilst tracing the kernel similar to what the SolidFire array does where we can control both reads and writes for better granularity. As with all monitoring there will be overheads.

If you are interested the source code is here: qos.py

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s