In my previous posts I have shown how the SolidFire Array QOS implements QOS on a per volume basis. As I was looking for a new project I asked myself could I write a similar QOS program which runs on a virtual machine?
I wondered could I dynamically change the Linux built-in QOS(throttling) which is implemented via cgroups based on a SolidFire QOS curve.
To get this working I would need to look at the I/O rate entering the block layer in the Linux kernel and writing a kernel module is not something I could do given I’m not great at c programming. However as I had contributed to the Iovisor BCC project a few years ago and thought this was exactly the tool set to help me achieve this. For further information see https://github.com/iovisor/bcc
Methodology
- Trace IOPS and BW for for each disk entering the block layer.
-
- Using (eBPF)
- Return these back to user space,work out average I/O and QOS.
-
- Using (eBPF) & python
- Apply QOS settings based on the read and write average I/O size.
-
- write maj:min qos to blkio.throttle.read/write_iops_device
A Quick SolidFire QOS Overview
QOS on the SolidFire works by rate limiting a volume coming into the array, this is based on an average I/O size normalised against a 4Kb I/O size cost
If we were to set the disks maximum QOS to 40K IOPS based on a 4Kb I/O size we would get 25K IOPS for 8Kb based on the SolidFire QOS curve.
root:~/bcc/tools# ./table.py 40000 40000 40000 |more IOSIZE COST MINIOPS MAXIOPS BIOPS MINMB MAXMB BURSTMB 4 100 40000 40000 40000 156 156 156 5 115 34783 34783 34783 170 170 170 6 130 30769 30769 30769 180 180 180 7 145 27586 27586 27586 189 189 189 8 160 25000 25000 25000 195 195 195
SolidFire uses a total IOPS figure for its QOS however linux cgroups allow a split of read and write throttling, this is much more flexible so in my example I will base the QOS for both reads and writes.
Testing
For testing QOS I used fio which is available at https://github.com/axboe/fio as it allows great flexibility. The first test I wanted to do was some random reads to see what I could expect from my VM.
Running the following fio random read test we can see from the iostat data that we are getting over 40K IOPS @ 8Kb.
; fio-rand-read.job for fiotest [global] name=fio-rand-read filename=fio-rand-read rw=randread bs=64K direct=1 numjobs=1 time_based=1 runtime=30 [file1] size=3G ioengine=libaio iodepth=16
Here we can see an sample of the iostat output which we are getting ~45K IOPS and 360MB/sec.
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 46805.00 0.00 365.66 0.00 0.00 0.00 0.00 0.00 0.32 0.00 14.95 8.00 0.00 0.02 100.00 sda 47065.00 0.00 367.70 0.00 0.00 0.00 0.00 0.00 0.31 0.00 14.82 8.00 0.00 0.02 100.00 sda 46958.00 0.00 366.86 0.00 0.00 0.00 0.00 0.00 0.32 0.00 14.82 8.00 0.00 0.02 100.00 sda 47319.00 17.00 369.68 0.07 0.00 0.00 0.00 0.00 0.32 0.00 15.02 8.00 4.00 0.02 100.00 sda 46877.00 0.00 366.22 0.00 0.00 0.00 0.00 0.00 0.32 0.00 14.84 8.00 0.00 0.02 100.00
As below we can see that for 40K IOPS we will get 25K Max IOPS at 8kb.
root:~/bcc/tools# ./table.py 20000 40000 50000 IOSIZE COST MINIOPS MAXIOPS BIOPS MINMB MAXMB BURSTMB 4 100 20000 40000 50000 78 156 195 8 160 12500 25000 31250 98 195 244
Setup
To setup our QOS for each disk we need to enter the major:minor max_iops into a file name qos_setup, an example is below.
root:~/bcc/tools# cat qos_setup 8:0 40000
The first thing we need to do is make sure the process and it’s children can be controlled via the cgroups , so lets add the pid to the blkio tasks.
root:/tmp# echo $$ > /sys/fs/cgroup/blkio/tasks root:/tmp# fio ./randr
Looking at the QOS tooling output below we can see after 10 seconds we check the current average I/O size and set the QOS values.
root:~/bcc/tools# ./qos.py Tracing... Output every 1 secs. Hit Ctrl-C to end TIME DISK RIOPS R MB/s R_AvgIO R_QOS WIOPS W_AvgIO W MB/s W_QOS 17:28:52 ??? 0 0 0 0 0 0 0 0 17:28:53 ??? 0 0 0 0 0 0 0 0 17:28:54 ??? 0 0 0 0 0 0 0 0 17:28:55 sda 33612 262 8 25000 0 0 0 0 17:28:56 sda 44305 346 8 25000 0 0 0 0 17:28:57 sda 32150 251 8 25000 0 0 0 0 17:28:58 sda 8471 66 8 25000 0 0 0 0 17:28:59 sda 17276 134 8 25000 0 0 0 0 17:29:00 sda 20492 160 8 25000 0 0 0 0 17:29:01 sda 25000 195 8 25000 0 0 0 0 17:29:02 sda 25047 195 8 25000 0 0 0 0 17:29:03 sda 25045 195 8 25000 0 0 0 0 17:29:04 sda 25096 196 8 25000 0 0 0 0 17:29:05 sda 25043 195 8 25000 3 4 0 40000 17:29:06 sda 24987 195 8 25000 0 0 0 0 17:29:07 sda 25000 195 8 25000 0 0 0 0 17:29:08 sda 25000 195 8 25000 0 0 0 0 17:29:09 sda 25054 195 8 25000 0 0 0 0 17:29:10 sda 25066 195 8 25000 0 0 0 0 17:29:11 sda 25037 195 8 25000 0 0 0 0 17:29:12 sda 25064 195 8 25000 0 0 0 0 17:29:13 sda 24993 195 8 25000 0 0 0 0 17:29:14 sda 25000 195 8 25000 0 0 0 0 17:29:15 sda 25016 195 8 25000 0 0 0 0 17:29:16 sda 25068 195 8 25000 0 0 0 0 17:29:17 sda 25049 195 8 25000 0 0 0 0 17:29:18 sda 25069 195 8 25000 0 0 0 0 17:29:19 sda 25091 196 8 25000 0 0 0 0 17:29:20 sda 24984 195 8 25000 0 0 0 0 17:29:21 sda 25000 195 8 25000 0 0 0 0 17:29:22 sda 25040 195 8 25000 0 0 0 0 17:29:23 sda 25043 195 8 25000 0 0 0 0 17:29:24 sda 25077 195 8 25000 0 0 0 0 17:29:25 sda 4813 37 8 25000 0 0 0 0 17:29:26 sda 0 0 0 0 0 0 0 0
The output is pretty self explanatory the RIOPS/WIOPS are the current number of IOPS being returned from the kernel, the R_QOS and W_QOS are showing the max IOPS for the 40K IOP limit for the average I/O size.
We can see that after ten seconds of running the tool looks at the average I/O size and then adjusts the QOS values @ 17:29:14.
Lets run another test in this we will copy the fio script and run this with 64KB along with the 8Kb run.
root:~/bcc/tools# ./qos.py Tracing... Output every 1 secs. Hit Ctrl-C to end TIME DISK RIOPS R MB/s R_AvgIO R_QOS WIOPS W_AvgIO W MB/s W_QOS 17:37:15 ??? 0 0 0 0 0 0 0 0 17:37:16 ??? 0 0 0 0 0 0 0 0 17:37:17 ??? 0 0 0 0 0 0 0 0 17:37:18 sda 10000 78 8 25000 1 24 0 10389 17:37:19 sda 23157 180 8 25000 0 0 0 0 17:37:20 sda 14968 116 8 25000 0 0 0 0 17:37:21 sda 25000 195 8 25000 0 0 0 0 17:37:22 sda 24984 195 8 25000 6 4 0 40000 17:37:23 sda 25000 195 8 25000 0 0 0 0 17:37:24 sda 25016 195 8 25000 17 4 0 40000 17:37:25 sda 25063 195 8 25000 0 0 0 0 17:37:26 sda 25069 195 8 25000 0 0 0 0 17:37:27 sda 25069 195 8 25000 0 0 0 0 17:37:28 sda 25116 196 8 25000 0 0 0 0 17:37:29 sda 25000 195 8 25000 0 0 0 0 17:37:30 sda 25000 195 8 25000 0 0 0 0 17:37:31 sda 25016 195 8 25000 0 0 0 0 17:37:32 sda 25054 195 8 25000 0 0 0 0 17:37:33 sda 25102 196 8 25000 0 0 0 0 17:37:34 sda 25062 195 8 25000 1 4 0 40000 17:37:35 sda 25079 195 8 25000 0 0 0 0 17:37:36 sda 25000 195 8 25000 3 4 0 40000 17:37:37 sda 25000 195 8 25000 0 0 0 0 17:37:38 sda 25016 195 8 25000 0 0 0 0 17:37:39 sda 25064 195 8 25000 0 0 0 0 17:37:40 sda 25074 195 8 25000 0 0 0 0 17:37:41 sda 21844 263 12 18221 0 0 0 0 17:37:42 sda 14592 385 27 9336 0 0 0 0 17:37:43 sda 9327 243 26 9427 0 0 0 0 17:37:44 sda 9330 245 26 9368 0 0 0 0 17:37:45 sda 9362 247 27 9332 0 0 0 0 17:37:46 sda 9345 250 27 9207 3 4 0 40000 17:37:47 sda 9355 254 27 9092 0 0 0 0 17:37:48 sda 8038 320 40 6269 0 0 0 0 17:37:49 sda 6287 392 64 4000 0 0 0 0 17:37:50 sda 4000 250 64 4000 0 0 0 0 17:37:51 sda 4000 250 64 4000 0 0 0 0 17:37:52 sda 4000 250 64 4000 0 0 0 0 17:37:53 sda 4000 250 64 4000 0 0 0 0 17:37:54 sda 4032 211 53 4769 0 0 0 0 17:37:55 sda 4007 108 27 9110 0 0 0 0 17:37:56 sda 4019 110 28 8992 0 0 0 0 17:37:57 sda 8977 245 28 9038 0 0 0 0 17:37:58 sda 8990 243 27 9105 0 0 0 0 17:37:59 sda 9028 247 28 9003 0 0 0 0 17:38:00 sda 9002 239 27 9256 0 0 0 0 17:38:01 sda 9005 244 27 9085 0 0 0 0 17:38:02 sda 8995 246 28 9006 0 0 0 0 17:38:03 sda 9030 257 29 8709 0 0 0 0 17:38:04 sda 8698 251 29 8581 0 0 0 0 17:38:05 sda 8700 253 29 8545 0 0 0 0 17:38:06 sda 8700 251 29 8587 3 4 0 40000 17:38:07 sda 8700 248 29 8692 0 0 0 0 17:38:08 sda 8732 248 29 8726 0 0 0 0 17:38:09 sda 8713 249 29 8680 0 0 0 0 17:38:10 sda 8714 248 29 8714 2 4 0 40000 17:38:11 sda 8710 169 19 12243 0 0 0 0 17:38:12 sda 8710 68 8 25000 0 0 0 0 17:38:13 sda 8710 68 8 25000 0 0 0 0 17:38:14 sda 8710 68 8 25000 0 0 0 0 17:38:15 sda 8754 68 8 25000 0 0 0 0
So we can see the tool is running along at 8Kb(R_AvgIO) until 17:31 when the other workload kicks in we can see at 17:37:48 the original 8Kb workload ends and we move back to 64kb I/O size QOS(4000 IOPS). And we then re-start the 8k workload at 17:37:54 which then moves back to 8k ay 17:38:12.
So we can see that the tools is adjusting the QOS every ten seconds based on the workloads.
We can also run mixed read and write workloads, let’s look at an example.
; fio-rand-RW.job for fiotest [global] name=fio-rand-RW filename=fio-rand-read rw=randrw rwmixread=60 rwmixwrite=40 bs=8K,32K direct=1 time_based=1 runtime=30 [file1] size=3G ioengine=libaio iodepth=16
So we will run 8k reads and 32k writes from our tooling we expect 25K read IOPS max and 8000 write IOPS max.
root:~/bcc/tools# ./table.py 20000 40000 40000 |egrep "IO|^8 |^32 " IOSIZE COST MINIOPS MAXIOPS BIOPS MINMB MAXMB BURSTMB 8 160 12500 25000 25000 98 195 195 32 500 4000 8000 8000 125 250 250
Lets baseline this and record the iostat data.
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 77.54 19.95 0.87 0.46 0.01 0.17 0.01 0.85 0.46 1.97 0.07 11.52 23.42 0.08 0.76 sda 5626.00 3814.00 43.95 118.75 0.00 0.00 0.00 0.00 0.31 0.98 5.47 8.00 31.88 0.04 36.00 sda 15712.00 10595.00 122.75 331.09 0.00 0.00 0.00 0.00 0.30 1.00 15.30 8.00 32.00 0.04 100.00 sda 15607.00 10546.00 121.93 329.56 0.00 0.00 0.00 0.00 0.30 0.98 14.91 8.00 32.00 0.04 99.60 sda 15585.00 10519.00 121.76 328.72 0.00 0.00 0.00 0.00 0.30 0.98 15.00 8.00 32.00 0.04 100.00 sda 15546.00 10494.00 121.45 327.94 0.00 0.00 0.00 0.00 0.30 0.97 14.90 8.00 32.00 0.04 99.60 sda 15641.00 10589.00 122.20 330.82 0.00 0.00 0.00 0.00 0.29 0.98 14.98 8.00 31.99 0.04 100.00 sda 15803.00 10643.00 123.46 332.59 0.00 0.00 0.00 0.00 0.29 0.97 14.98 8.00 32.00 0.04 100.00 sda 15643.00 10566.00 122.21 330.19 0.00 0.00 0.00 0.00 0.29 0.98 14.97 8.00 32.00 0.04 100.00 sda 15523.00 10482.00 121.27 327.56 0.00 0.00 0.00 0.00 0.31 0.97 14.88 8.00 32.00 0.04 99.60 sda 15754.00 10644.00 123.08 332.62 0.00 0.00 0.00 0.00 0.29 0.98 15.04 8.00 32.00 0.04 100.00 sda 15758.00 10634.00 123.11 332.20 0.00 0.00 0.00 0.00 0.30 0.95 14.85 8.00 31.99 0.04 99.20 sda 15702.00 10614.00 122.67 331.69 0.00 0.00 0.00 0.00 0.30 0.96 14.93 8.00 32.00 0.04 100.00 sda 15733.00 10607.00 122.91 331.44 0.00 0.00 0.00 0.00 0.30 0.97 15.01 8.00 32.00 0.04 100.00 sda 15692.00 10595.00 122.59 331.09 0.00 0.00 0.00 0.00 0.27 0.98 14.70 8.00 32.00 0.04 100.00 sda 15617.00 10557.00 122.01 329.91 0.00 0.00 0.00 0.00 0.31 0.97 15.12 8.00 32.00 0.04 99.20 sda 15592.00 10532.00 121.81 329.04 0.00 0.00 0.00 0.00 0.30 0.98 15.09 8.00 31.99 0.04 100.00 sda 15383.17 10385.15 120.18 324.54 0.00 0.00 0.00 0.00 0.30 0.99 14.87 8.00 32.00 0.04 99.01 sda 15683.00 10582.00 122.52 330.69 0.00 0.00 0.00 0.00 0.30 0.98 15.02 8.00 32.00 0.04 99.60 sda 15774.00 10650.00 123.23 332.75 0.00 0.00 0.00 0.00 0.29 0.97 14.92 8.00 31.99 0.04 100.00 sda 15614.00 10549.00 121.98 329.72 0.00 0.00 0.00 0.00 0.30 0.98 15.04 8.00 32.01 0.04 99.60 sda 15678.00 10594.00 122.48 330.98 0.00 0.00 0.00 0.00 0.30 0.97 15.06 8.00 31.99 0.04 100.00 sda 15677.00 10562.00 122.48 330.06 0.00 0.00 0.00 0.00 0.31 0.96 14.93 8.00 32.00 0.04 100.00 sda 15733.00 10655.00 122.91 332.56 0.00 0.00 0.00 0.00 0.29 0.98 14.98 8.00 31.96 0.04 98.80 sda 15719.00 10607.00 122.80 331.47 0.00 0.00 0.00 0.00 0.29 0.98 14.96 8.00 32.00 0.04 100.00 sda 15436.00 10434.00 120.59 326.06 0.00 0.00 0.00 0.00 0.31 0.99 15.16 8.00 32.00 0.04 100.00 sda 15605.00 10534.00 121.91 329.11 0.00 0.00 0.00 0.00 0.31 0.98 15.16 8.00 31.99 0.04 100.00 sda 15624.00 10538.00 122.06 329.31 0.00 0.00 0.00 0.00 0.31 0.93 14.66 8.00 32.00 0.04 98.80 sda 15444.00 10444.00 120.66 326.37 0.00 0.00 0.00 0.00 0.30 0.99 14.88 8.00 32.00 0.04 99.60 sda 15723.00 10590.00 122.84 330.94 0.00 0.00 0.00 0.00 0.30 0.94 14.78 8.00 32.00 0.04 99.60 sda 15109.00 10198.00 118.04 318.69 0.00 0.00 0.00 0.00 0.34 0.99 15.22 8.00 32.00 0.04 100.00 sda 9866.00 6672.00 77.08 208.34 0.00 0.00 0.00 0.00 0.29 1.00 9.50 8.00 31.97 0.04 62.80
We can see ~15K read and 10K write IOPS. So lets apply QOS and re-run.
root:~/bcc/tools# ./qos.py Tracing... Output every 1 secs. Hit Ctrl-C to end TIME DISK RIOPS R MB/s R_AvgIO R_QOS WIOPS W_AvgIO W MB/s W_QOS 17:45:47 ??? 0 0 0 0 0 0 0 0 17:45:48 ??? 0 0 0 0 0 0 0 0 17:45:49 ??? 0 0 0 0 0 0 0 0 17:45:50 sda 49 0 8 25000 30 32 0 8000 17:45:51 sda 15299 119 8 25000 10343 32 323 8000 17:45:52 sda 15491 121 8 25000 10447 32 326 8000 17:45:53 sda 15240 119 8 25000 10305 32 322 8000 17:45:54 sda 11901 92 8 25000 8000 32 250 8000 17:45:55 sda 11845 92 8 25000 8000 31 249 8002 17:45:56 sda 11839 92 8 25000 8000 32 250 8000 17:45:57 sda 11842 92 8 25000 8000 32 250 8000 17:45:58 sda 11835 92 8 25000 8000 32 250 8000 17:45:59 sda 11846 92 8 25000 8016 32 250 8000 17:46:00 sda 11874 92 8 25000 8008 31 250 8002 17:46:01 sda 11884 92 8 25000 8000 32 250 8000 17:46:02 sda 11837 92 8 25000 8000 32 250 8000 17:46:03 sda 11844 92 8 25000 8016 32 250 8000 17:46:04 sda 11849 92 8 25000 8011 32 250 8000 17:46:05 sda 11848 92 8 25000 7995 31 249 8003 17:46:06 sda 10741 83 8 25000 7253 32 226 8000 17:46:07 sda 11473 89 8 25000 7740 31 241 8003 17:46:08 sda 11865 92 8 25000 7999 32 249 8000 17:46:09 sda 11848 92 8 25000 8000 32 250 8000 17:46:10 sda 11830 92 8 25000 8000 31 249 8002 17:46:11 sda 11819 92 8 25000 7985 32 249 8000 17:46:12 sda 11877 92 8 25000 8015 32 250 8000 17:46:13 sda 11882 92 8 25000 8015 32 250 8000 17:46:14 sda 11886 92 8 25000 8029 32 250 8000 17:46:15 sda 11849 92 8 25000 7999 31 249 8002 17:46:16 sda 11845 92 8 25000 8000 32 250 8000 17:46:17 sda 11853 92 8 25000 8000 32 250 8000 17:46:18 sda 11867 92 8 25000 8000 32 250 8000 17:46:19 sda 11859 92 8 25000 8016 32 250 8000 17:46:20 sda 11064 86 8 25000 7495 32 234 8000 17:46:21 sda 0 0 0 0 3 4 0 40000 17:46:22 sda 0 0 0 0 0 0 0 0 17:46:23 sda 0 0 0 0 0 0 0 0 17:46:24 sda 0 0 0 0 0 0 0 0 ^C17:46:25 sda 0 0 0 0 0 0 0 0
iostat data:
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 5158.00 3480.00 40.30 108.75 0.00 0.00 0.00 0.00 0.35 1.08 5.56 8.00 32.00 0.04 36.40 sda 15682.00 10576.00 122.52 330.50 0.00 0.00 0.00 0.00 0.30 0.95 14.80 8.00 32.00 0.04 100.00 sda 15261.00 10323.00 119.23 322.59 0.00 0.00 0.00 0.00 0.31 0.99 14.96 8.00 32.00 0.04 100.00 sda 14169.00 9566.00 110.70 298.94 0.00 0.00 0.00 0.00 0.32 0.96 13.73 8.00 32.00 0.04 92.00 sda 11856.00 8006.00 92.62 250.19 0.00 0.00 0.00 0.00 0.29 0.93 10.90 8.00 32.00 0.04 75.20 sda 11821.00 7978.00 92.35 249.23 0.00 0.00 0.00 0.00 0.30 0.96 11.20 8.00 31.99 0.04 76.00 sda 11859.00 7996.00 92.65 249.84 0.00 0.00 0.00 0.00 0.31 0.92 11.10 8.00 32.00 0.04 77.20 sda 11886.00 8026.00 92.86 250.84 0.00 0.00 0.00 0.00 0.31 0.94 11.18 8.00 32.00 0.04 77.20 sda 11681.00 7886.00 91.26 246.44 0.00 0.00 0.00 0.00 0.32 0.97 11.33 8.00 32.00 0.04 76.80 sda 11991.00 8112.00 93.68 253.50 0.00 0.00 0.00 0.00 0.30 0.94 11.22 8.00 32.00 0.04 77.20 sda 11762.00 7941.00 91.89 248.07 0.00 0.00 0.00 0.00 0.31 0.93 10.97 8.00 31.99 0.04 75.60 sda 11815.00 7980.00 92.30 249.38 0.00 0.00 0.00 0.00 0.31 0.94 11.16 8.00 32.00 0.04 76.40 sda 11781.00 7964.00 92.04 248.88 0.00 0.00 0.00 0.00 0.37 0.98 12.18 8.00 32.00 0.04 82.80 sda 11893.00 8010.00 92.91 250.31 0.00 0.00 0.00 0.00 0.47 0.95 13.21 8.00 32.00 0.05 89.60 sda 11831.00 7985.00 92.43 249.53 0.00 0.00 0.00 0.00 0.47 0.95 13.08 8.00 32.00 0.05 90.00 sda 10653.00 7214.00 83.23 225.33 0.00 0.00 0.00 0.00 0.46 1.21 13.66 8.00 31.98 0.05 92.00 sda 11691.00 7884.00 91.34 246.38 0.00 0.00 0.00 0.00 0.49 0.94 13.08 8.00 32.00 0.05 91.20 sda 11657.00 7860.00 91.07 245.52 0.00 0.00 0.00 0.00 0.35 1.00 11.88 8.00 31.99 0.04 81.20 sda 11834.00 7996.00 92.45 249.87 0.00 0.00 0.00 0.00 0.32 0.97 11.52 8.00 32.00 0.04 77.20 sda 11844.00 8002.00 92.53 250.06 0.00 0.00 0.00 0.00 0.34 0.94 11.52 8.00 32.00 0.04 78.00 sda 11802.00 7982.00 92.20 249.36 0.00 0.00 0.00 0.00 0.35 0.92 11.44 8.00 31.99 0.04 79.60 sda 11858.00 8008.00 92.64 250.25 0.00 0.00 0.00 0.00 0.30 0.93 10.99 8.00 32.00 0.04 77.60 sda 11866.00 8000.00 92.70 250.00 0.00 0.00 0.00 0.00 0.31 0.93 11.12 8.00 32.00 0.04 76.40 sda 11854.00 7993.00 92.61 249.78 0.00 0.00 0.00 0.00 0.30 0.95 11.15 8.00 32.00 0.04 76.40 sda 11736.00 7928.00 91.69 247.75 0.00 0.00 0.00 0.00 0.30 0.96 11.12 8.00 32.00 0.04 76.00 sda 11865.00 8008.00 92.70 250.17 0.00 0.00 0.00 0.00 0.31 0.94 11.21 8.00 31.99 0.04 77.20 sda 11839.00 7997.00 92.49 249.91 0.00 0.00 0.00 0.00 0.32 0.93 11.23 8.00 32.00 0.04 75.20 sda 11841.00 7997.00 92.51 249.91 0.00 0.00 0.00 0.00 0.31 0.96 11.38 8.00 32.00 0.04 77.20 sda 11836.00 8004.00 92.47 250.13 0.00 0.00 0.00 0.00 0.32 0.90 11.03 8.00 32.00 0.04 76.40 sda 11844.00 8001.00 92.53 250.03 0.00 0.00 0.00 0.00 0.31 0.95 11.23 8.00 32.00 0.04 76.80 sda 7501.00 5072.00 58.60 158.50 0.00 0.00 0.00 0.00 0.32 0.94 7.13 8.00 32.00 0.04 48.80
Here we have the OS pushing now only 11K read IOPS and bringing down the writes to the 8K IOP limit. This is below what iostat was showing in the baseline test so lets run with QOS enabled at the cgroup but without the monitoring tool.
First we need to set the cgroup QOS to what we expect it to be.
root:/tmp# echo '8:0 8000' > /sys/fs/cgroup/blkio/blkio.throttle.write_iops_device root:/tmp# cat /sys/fs/cgroup/blkio/blkio.throttle.*iops* 8:0 25000 8:0 8000
And from the below iostat data we can see that it’s the cgroup QOS which is limiting the IOPS to 11K not the tooling.
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 8150.00 5491.00 63.67 171.59 0.00 0.00 0.00 0.00 0.29 0.99 7.82 8.00 32.00 0.04 52.40 sda 10814.00 7297.00 84.48 228.03 0.00 0.00 0.00 0.00 0.38 1.04 11.77 8.00 32.00 0.04 78.40 sda 11813.00 7982.00 92.29 249.44 0.00 0.00 0.00 0.00 0.28 0.95 10.94 8.00 32.00 0.04 74.00 sda 11872.00 8021.00 92.75 250.55 0.00 0.00 0.00 0.00 0.29 0.96 11.11 8.00 31.99 0.04 74.40 sda 11838.00 8006.00 92.48 250.19 0.00 0.00 0.00 0.00 0.28 0.94 10.91 8.00 32.00 0.04 74.00 sda 11832.00 7991.00 92.44 249.72 0.00 0.00 0.00 0.00 0.29 0.97 11.08 8.00 32.00 0.04 74.40 sda 11867.00 8004.00 92.71 250.10 0.00 0.00 0.00 0.00 0.30 0.93 10.92 8.00 32.00 0.04 74.80 sda 11806.00 7969.00 92.23 249.03 0.00 0.00 0.00 0.00 0.30 0.95 11.11 8.00 32.00 0.04 74.80 sda 11899.00 8032.00 92.96 250.89 0.00 0.00 0.00 0.00 0.27 0.94 10.75 8.00 31.99 0.04 74.80 sda 11827.00 7989.00 92.40 249.66 0.00 0.00 0.00 0.00 0.28 0.96 11.06 8.00 32.00 0.04 74.80 sda 11845.00 8006.00 92.54 250.19 0.00 0.00 0.00 0.00 0.28 0.97 11.13 8.00 32.00 0.04 74.80 sda 11842.00 8002.00 92.52 250.01 0.00 0.00 0.00 0.00 0.28 0.96 11.02 8.00 31.99 0.04 74.00 sda 11850.00 7998.00 92.58 249.94 0.00 0.00 0.00 0.00 0.28 0.96 10.97 8.00 32.00 0.04 74.80 sda 11871.00 8004.00 92.74 250.04 0.00 0.00 0.00 0.00 0.28 0.96 11.00 8.00 31.99 0.04 74.80 sda 11776.00 7951.00 92.00 248.47 0.00 0.00 0.00 0.00 0.28 0.98 11.08 8.00 32.00 0.04 75.60 sda 11928.00 8053.00 93.19 251.66 0.00 0.00 0.00 0.00 0.31 0.92 11.08 8.00 32.00 0.04 75.60 sda 11848.00 8005.00 92.56 250.16 0.00 0.00 0.00 0.00 0.30 0.96 11.15 8.00 32.00 0.04 74.80 sda 11849.00 8008.00 92.57 250.25 0.00 0.00 0.00 0.00 0.30 0.95 11.11 8.00 32.00 0.04 75.20 sda 11835.00 7997.00 92.46 249.82 0.00 0.00 0.00 0.00 0.29 0.95 10.97 8.00 31.99 0.04 74.40 sda 11850.00 7985.00 92.58 249.53 0.00 0.00 0.00 0.00 0.28 0.95 10.81 8.00 32.00 0.04 75.60 sda 11857.00 7998.00 92.63 249.94 0.00 0.00 0.00 0.00 0.28 0.96 11.04 8.00 32.00 0.04 76.00 sda 11873.00 8019.00 92.76 250.59 0.00 0.00 0.00 0.00 0.28 0.94 10.82 8.00 32.00 0.04 74.80 sda 11824.00 7983.00 92.38 249.47 0.00 0.00 0.00 0.00 0.30 0.92 10.94 8.00 32.00 0.04 76.00 sda 11843.00 8011.00 92.52 250.26 0.00 0.00 0.00 0.00 0.28 0.97 11.16 8.00 31.99 0.04 75.60 sda 11717.82 7921.78 91.55 247.56 0.00 0.00 0.00 0.00 0.28 0.94 10.73 8.00 32.00 0.04 73.66 sda 11860.00 7994.00 92.66 249.81 0.00 0.00 0.00 0.00 0.29 0.95 11.00 8.00 32.00 0.04 74.40 sda 11860.00 7996.00 92.66 249.87 0.00 0.00 0.00 0.00 0.28 0.95 10.89 8.00 32.00 0.04 74.80 sda 11838.00 7991.00 92.48 249.72 0.00 0.00 0.00 0.00 0.30 0.96 11.16 8.00 32.00 0.04 76.80 sda 11854.00 8013.00 92.61 250.32 0.00 0.00 0.00 0.00 0.29 0.94 10.98 8.00 31.99 0.04 75.20 sda 11828.00 7986.00 92.41 249.56 0.00 0.00 0.00 0.00 0.29 0.95 10.98 8.00 32.00 0.04 75.20 sda 3733.00 2528.00 29.16 79.00 0.00 0.00 0.00 0.00 0.29 0.93 3.44 8.00 32.00 0.04 24.00
Whats the overhead??
Looking at the fio cpu usage with just cgroup QOS and no monitoring.
cpu : usr=5.81%, sys=17.43%, ctx=212268, majf=0, minf=13
And with QOS monitoring running.
cpu : usr=5.81%, sys=21.99%, ctx=193375, majf=0, minf=11
Conclusion
As you can see we can dynamically change the I/O size based on the average I/O size whilst tracing the kernel similar to what the SolidFire array does where we can control both reads and writes for better granularity. As with all monitoring there will be overheads.
If you are interested the source code is here: qos.py