Overview
In this post we will take a look a Hosts and Volumes the same metrics are available for both so to save duplication of effort and two posts this will be combined. We can gather details on hosts and volumes using the standard statvlun command and we can also pass in a few options to give a consolidated view of the metrics.
- Cache
- statcmp
- Cpu
- statcpu
- Hosts
- statvlun / statvlun -hostsum / -vvsum
- Ports
- statport -host
- Volumes
- statvlun -vvsum
Getting started..
How do I run the statvlun command? Below shows how to setup a passfile so we don’t need to input a password for our commands.
# PATH=$PATH:/opt/hp_3par_cli/bin/ # setpassword -saveonly -file mypass.your3par system: your3par user: password:
Hosts
Host performance metrics are very detailed from the statvlun output we have the following header which details what we can utilise.
20:01:01 12/07/2017 r/w I/O per second KBytes per sec Svt ms IOSz KB Lun VVname Host Port Cur Avg Max Cur Avg Max Cur Avg Cur Avg Qlen
Lets run through what these are:
- Lun: Lun number
- VVname: Volume name
- Host: Host name
- Port: Array port being accessed
r/w I/O per second
- Cur: Current IOPS/sec
- Avg: Average IOPS/sec
- Max: Max IOPS/sec
KBytes per sec
- Cur: Current Kbytes/sec
- Avg: Average Kbytes/sec during sample period
- Max: Maximum Kbytes/sec during sample period
Svt ms
- Cur: Current service time in milliseconds
- Avg: Average service time in milliseconds during the sample period
IOSz – I/O Size
- Cur: Current I/O size
- Avg: Average I/O size during sample period.
- Qlen: Length of the volumes Queue
So what metrics are we interested in, well it depends but I take the SUM of all Current IOPS/Kbytes and Qlen. I also take the MAX of the service time & I/O size from the current sample.
Why do I take the maximum, well if we are looking at many volumes being accessed by many hosts say in a vmware setup I want to know who is having the worst experience rather than the average over them all.
statvlun command output
Below we have some example output which has been truncated to a single sample.
Command run was statvlun -rw -ni -iter 320 -d 45
20:01:01 09/15/2017 r/w I/O per second KBytes per sec Svt ms IOSz KB Lun VVname Host Port Cur Avg Max Cur Avg Max Cur Avg Cur Avg Qlen 0 VOL1 SERVERA 3:5:1 r 904 904 904 204147 204147 204147 157.14 157.14 225.9 225.9 - 0 VOL1 SERVERA 3:5:1 w 0 0 0 2 2 2 8.72 8.72 8.2 8.2 - 0 VOL1 SERVERA 3:5:1 t 904 904 904 204148 204148 204148 157.11 157.11 225.9 225.9 117 1 VOL2 SERVERA 3:5:1 r 0 0 0 0 0 0 0.00 0.00 0.0 0.0 - 1 VOL2 SERVERA 3:5:1 w 0 0 0 0 0 0 10.87 10.87 1.0 1.0 - 1 VOL2 SERVERA 3:5:1 t 0 0 0 0 0 0 10.87 10.87 1.0 1.0 0 2 VOL3 SERVERA 3:5:1 r 4 4 4 32 32 32 126.05 126.05 9.1 9.1 - 2 VOL3 SERVERA 3:5:1 w 0 0 0 9 9 9 17.97 17.97 19.4 19.4 - 2 VOL3 SERVERA 3:5:1 t 4 4 4 41 41 41 113.53 113.53 10.3 10.3 1 3 VOL4 SERVERA 3:5:1 r 21 21 21 338 338 338 123.95 123.95 16.1 16.1 - 3 VOL4 SERVERA 3:5:1 w 67 67 67 4478 4478 4478 372.72 372.72 67.3 67.3 - 3 VOL4 SERVERA 3:5:1 t 88 88 88 4816 4816 4816 312.94 312.94 55.0 55.0 3 4 VOL5 SERVERA 3:5:1 r 0 0 0 0 0 0 0.00 0.00 0.0 0.0 - 4 VOL5 SERVERA 3:5:1 w 0 0 0 0 0 0 0.00 0.00 0.0 0.0 - 4 VOL5 SERVERA 3:5:1 t 0 0 0 0 0 0 0.00 0.00 0.0 0.0 0 5 VOL6 SERVERA 3:5:1 r 0 0 0 0 0 0 0.00 0.00 0.0 0.0 - 5 VOL6 SERVERA 3:5:1 w 0 0 0 0 0 0 0.00 0.00 0.0 0.0 - 5 VOL6 SERVERA 3:5:1 t 0 0 0 0 0 0 0.00 0.00 0.0 0.0 0 6 VOL7 SERVERA 3:5:1 r 0 0 0 0 0 0 0.00 0.00 0.0 0.0 - 6 VOL7 SERVERA 3:5:1 w 0 0 0 0 0 0 0.00 0.00 0.0 0.0 - 6 VOL7 SERVERA 3:5:1 t 0 0 0 0 0 0 0.00 0.00 0.0 0.0 0 0 VOL1 SERVERA 2:4:2 r 902 902 902 205696 205696 205696 201.94 201.94 228.1 228.1 - 0 VOL1 SERVERA 2:4:2 w 0 0 0 2 2 2 28.24 28.24 9.2 9.2 - 0 VOL1 SERVERA 2:4:2 t 902 902 902 205698 205698 205698 201.91 201.91 228.0 228.0 205
In this example there are a few things jumping out. One is that we are queuing we have a queue length of 117 & 205 down both ports which is affecting our service time > 200ms.
But why? well the answer is pretty easy in this example that the server is reading 200MB/sec from both ports and we know that SERVERA is connected to the SAN at 2Gb where 200MB/sec is the maximum that the HBA can sustain in one direction.
So why does the latency increase ?
Well when the array is sending data to the server we use a credit based system called buffer to buffer credits when transmitting over fibre channel. As fibre channel is a lossless protocol we don’t drop frames so we queue and await a free buffer to be able to send the next part of our data. If we think about when we issue a write we need to wait for the ACK from the array and this ACK joins the back of the queue i.e. the read path from the array to host and this increases latency.
Ok why the queue?
The queue builds up as we can transmit data faster than the server can receive. As most arrays are connected at 8/16Gb these can transfer data faster 4-8x faster than a HBA connected at 2Gb, this is a subscription issue where we have one fast port and one slow port.
** What to track High Queuing / Service Times / Bandwidth ( make sure you know the speed of your servers HBA’s and if they are getting closed to maximum)**
A awk based parsing script
So we can parse this data pretty easily using awk where we can create associative arrays based on what we are looking for.
BEGIN{ printf("%8s %-22s %1s \t %10s\n","Time","Host","Type","Value"); } { if($0 ~ /KBytes/) { t=$1 } svct=0 iosz=0 if ( $4 ~ /[0-9]:[0-9]:[0-9]/) { iops[$3"\t"$5]+=$6; bw[$3"\t"$5]+=$9; if ( $12 > svc[$3"\t"$5] ) { svc[$3"\t"$5]=$12; svct=$12 } if ( $14 > ios[$3"\t"$5] ) { ios[$3"\t"$5]=$14; iosz=$14 } if ($5 == "t" ) { qu[$3"\tq"]+=$NF; } } if ($0 ~ /^$/) { for (io in iops) { split(io,iop,"\t") printf("%8s %-22s IOPS_%1s \t %10s\n",t,iop[1],iop[2],iops[io]); delete iops[io]; } for (b in bw) { split(b,mb,"\t") printf("%8s %-22s BAND_%1s \t %10s\n",t,mb[1],mb[2],bw[b]/1024); delete bw[b] } for (q in qu) { split(q,qln,"\t") printf("%8s %-22s QLEN_%1s \t %10s\n",t,qln[1],qln[2],qu[q]); delete qu[q] } for (s in svc) { split(s,lat,"\t") printf("%8s %-22s SVCT_%1s \t %10s\n",t,lat[1],lat[2],svc[s]); delete svc[s] } for (i in ios) { split(i,isz,"\t") printf("%8s %-22s IOSZ_%1s \t %10s\n",t,isz[1],isz[2],ios[i]); delete ios[i] } } }
We can also use this script with a file by doing the following.
# awk -f ./awk_host statvlun.out |more Time Host Type Value 20:01:01 SERVERA IOPS_r 1856 20:01:01 SERVERA IOPS_t 1990 20:01:01 SERVERA IOPS_w 134 20:01:01 SERVERA BAND_r 400.967 20:01:01 SERVERA BAND_t 409.705 20:01:01 SERVERA BAND_w 8.73926 20:01:01 SERVERA QLEN_q 329 20:01:01 SERVERA SVCT_r 201.94 20:01:01 SERVERA SVCT_t 350.34 20:01:01 SERVERA SVCT_w 410.13 20:02:08 SERVERA IOPS_r 2152 20:02:08 SERVERA IOPS_t 2158 20:02:08 SERVERA IOPS_w 6 20:02:08 SERVERA BAND_r 398.93 20:02:08 SERVERA BAND_t 399.664 20:02:08 SERVERA BAND_w 0.733398 20:02:08 SERVERA QLEN_q 230 20:02:08 SERVERA SVCT_r 136.81 20:02:08 SERVERA SVCT_t 136.81 20:02:08 SERVERA SVCT_w 7.52 . . . 20:23:42 SERVERA BAND_r 83.8418 20:23:42 SERVERA BAND_t 127.925 20:23:42 SERVERA BAND_w 44.082 20:23:42 SERVERA QLEN_q 0 20:23:42 SERVERA SVCT_r 6.39 20:23:42 SERVERA SVCT_t 7.06 20:23:42 SERVERA SVCT_w 9.88
I think its good to look at the above and see what happens to the response times when we are not queuing or maxing out our HBA’s at the server-side. We can see that we have sub 10ms response times as a maximum over all the volumes allocated to the host.
For volumes we just replace the $3 to be $2 in the awk statement.
20:01:01 SERVERA-CLSTR-VOL5 IOPS_r 0 20:01:01 SERVERA-CLSTR-VOL5 IOPS_t 0 20:01:01 SERVERA-CLSTR-VOL5 IOPS_w 0 20:01:01 SERVERA-CLSTR-VOL7 IOPS_r 0 20:01:01 SERVERA-CLSTR-VOL7 IOPS_t 0 20:01:01 SERVERA-CLSTR-VOL7 IOPS_w 0 20:01:01 SERVERA-CLSTR-VOL2 IOPS_r 0 20:01:01 SERVERA-CLSTR-VOL2 IOPS_t 0 20:01:01 SERVERA-CLSTR-VOL2 IOPS_w 0 20:01:01 SERVERA-CLSTR-VOL4 IOPS_r 42 20:01:01 SERVERA-CLSTR-VOL4 IOPS_t 176 20:01:01 SERVERA-CLSTR-VOL4 IOPS_w 134 20:01:01 SERVERA-CLSTR-VOL6 IOPS_r 0 20:01:01 SERVERA-CLSTR-VOL6 IOPS_t 0 20:01:01 SERVERA-CLSTR-VOL6 IOPS_w 0 20:01:01 SERVERA-CLSTR-VOL1 IOPS_r 1806 20:01:01 SERVERA-CLSTR-VOL1 IOPS_t 1806 20:01:01 SERVERA-CLSTR-VOL1 IOPS_w 0 20:01:01 SERVERA-CLSTR-VOL3 IOPS_r 8 20:01:01 SERVERA-CLSTR-VOL3 IOPS_t 8 20:01:01 SERVERA-CLSTR-VOL3 IOPS_w 0 20:01:01 SERVERA-CLSTR-VOL5 BAND_r 0 20:01:01 SERVERA-CLSTR-VOL5 BAND_t 0 20:01:01 SERVERA-CLSTR-VOL5 BAND_w 0 20:01:01 SERVERA-CLSTR-VOL7 BAND_r 0 20:01:01 SERVERA-CLSTR-VOL7 BAND_t 0 20:01:01 SERVERA-CLSTR-VOL7 BAND_w 0 20:01:01 SERVERA-CLSTR-VOL2 BAND_r 0 20:01:01 SERVERA-CLSTR-VOL2 BAND_t 0 20:01:01 SERVERA-CLSTR-VOL2 BAND_w 0 20:01:01 SERVERA-CLSTR-VOL4 BAND_r 0.669922 20:01:01 SERVERA-CLSTR-VOL4 BAND_t 9.3877 20:01:01 SERVERA-CLSTR-VOL4 BAND_w 8.71777 20:01:01 SERVERA-CLSTR-VOL6 BAND_r 0 20:01:01 SERVERA-CLSTR-VOL6 BAND_t 0 20:01:01 SERVERA-CLSTR-VOL6 BAND_w 0 20:01:01 SERVERA-CLSTR-VOL1 BAND_r 400.237 20:01:01 SERVERA-CLSTR-VOL1 BAND_t 400.24
How long does it take to parse the data?
$ time awk -f ./awk_host statvlun.out > batch1 real 2m25.767s user 1m34.406s sys 0m50.563s $ wc -l statvlun.out 13659252 statvlun.out
Not bad for parsing 13,659,252 lines of text, however I would then need to post process this to convert into a format for influx unlike the python version where this is all done at once.
$ grep SERVERA batch1 20:01:01 SERVERA IOPS_r 1856 20:01:01 SERVERA IOPS_t 1990 20:01:01 SERVERA IOPS_w 134 20:01:01 SERVERA BAND_r 400.967 20:01:01 SERVERA BAND_t 409.705 20:01:01 SERVERA BAND_w 8.73926 20:01:01 SERVERA QLEN_q 329 20:01:01 SERVERA SVCT_r 201.94 20:01:01 SERVERA SVCT_t 350.34 20:01:01 SERVERA SVCT_w 410.13 v.s $ time ./3par_hosts.py >batch2 real 3m14.091s user 3m6.891s sys 0m6.313s $ grep -i SERVERA batch2 HostIO,HostIO=servera,type=iops value=1990 1505502061 HostBW,HostBW=servera,type=mb value=409 1505502061 HostRT,HostRT=servera,type=ms value=350.34 1505502061 HostIOSZ,HostIOSZ=servera,type=iosz value=228.0 1505502061 HostQ,HostQ=servera,type=qls value=329 1505502061 HostIO,HostIO=servera,type=riops value=1856 1505502061 HostBW,HostBW=servera,type=rmb value=400 1505502061 HostRT,HostRT=servera,type=rms value=201.94 1505502061 HostIOSZ,HostIOSZ=servera,type=riosz value=228.1 1505502061 HostIO,HostIO=servera,type=wiops value=134 1505502061
Looking at parsing for historical data these tools do the job based on the standard output of the statvlun command however for real-time we need something faster. As such we can use the statvlun command with a specific host with -host or using the -hostsum option to aggregate the volumes per host. If we are looking at volume performance the -vvsum option for aggregated data.
What does this look like when we have it in a influx database?
And below in the same view if we wanted to look at multi volume stats.
In later posts I will make available the grafana dashboard so you can create the same views. If you want to have a start then the scripts are on my github page.
3par Performance Scripts
One thought on “3PAR Performance Monitoring – Hosts & Volumes”