3PAR Performance Monitoring – Hosts & Volumes

Overview

In this post we will take a look a Hosts and Volumes the same metrics are available for both so to save duplication of effort and two posts this will be combined. We can gather details on hosts and volumes using the standard statvlun command and we can also pass in a few options to give a consolidated view of the metrics.

  • Cache 
    • statcmp
  • Cpu
    • statcpu
  • Hosts
    • statvlun / statvlun -hostsum / -vvsum
  • Ports
    • statport -host
  • Volumes
    • statvlun -vvsum

Getting started..

How do I run the statvlun command? Below shows how to setup a passfile so we don’t need to input a password for our commands.

# PATH=$PATH:/opt/hp_3par_cli/bin/
# setpassword -saveonly -file mypass.your3par 
system: your3par 
user:  
password:

Hosts

Host performance metrics are very detailed from the statvlun output we have the following header which details what we can utilise.

             20:01:01 12/07/2017 r/w I/O per second    KBytes per sec        Svt ms     IOSz KB
Lun        VVname     Host  Port      Cur  Avg  Max   Cur   Avg   Max    Cur    Avg   Cur   Avg Qlen

Lets run through what these are:

  • Lun: Lun number
  • VVname: Volume name
  • Host: Host name
  • Port: Array port being accessed

r/w I/O per second

  • Cur: Current IOPS/sec
  • Avg: Average IOPS/sec
  • Max: Max IOPS/sec

KBytes per sec

  • Cur: Current Kbytes/sec
  • Avg: Average Kbytes/sec during sample period
  • Max: Maximum Kbytes/sec during sample period

Svt ms 

  • Cur: Current service time in milliseconds
  • Avg: Average service time in milliseconds during the sample period

IOSz  – I/O Size

  • Cur: Current I/O size
  • Avg: Average I/O size during sample period.
  • Qlen: Length of the volumes Queue

So what metrics are we interested in, well it depends but I take the SUM of all Current IOPS/Kbytes and Qlen. I also take the MAX of the service time & I/O size from the current sample.

Why do I take the maximum, well if we are looking at many volumes being accessed by many hosts say in a vmware setup I want to know who is having the worst experience rather than the average over them all.

statvlun command output

Below we have some example output which has been truncated to a single sample.

Command run was statvlun -rw -ni -iter 320 -d 45
           		       20:01:01 09/15/2017 r/w I/O per second           KBytes per sec  Svt ms   IOSz KB
Lun         VVname     		   Host  Port           Cur  Avg   Max   Cur     Avg     Max   Cur    Avg    Cur   Avg   Qlen
 0           VOL1                  SERVERA 3:5:1   r    904    904  904  204147  204147 204147 157.14 157.14 225.9 225.9    -
 0           VOL1                  SERVERA 3:5:1   w      0      0    0       2       2      2   8.72   8.72   8.2   8.2    -
 0           VOL1                  SERVERA 3:5:1   t    904    904  904  204148  204148 204148 157.11 157.11 225.9 225.9  117
 1           VOL2                  SERVERA 3:5:1   r      0      0    0       0       0      0   0.00   0.00   0.0   0.0    -
 1           VOL2                  SERVERA 3:5:1   w      0      0    0       0       0      0  10.87  10.87   1.0   1.0    -
 1           VOL2                  SERVERA 3:5:1   t      0      0    0       0       0      0  10.87  10.87   1.0   1.0    0
 2           VOL3                  SERVERA 3:5:1   r      4      4    4      32      32     32 126.05 126.05   9.1   9.1    -
 2           VOL3                  SERVERA 3:5:1   w      0      0    0       9       9      9  17.97  17.97  19.4  19.4    -
 2           VOL3                  SERVERA 3:5:1   t      4      4    4      41      41     41 113.53 113.53  10.3  10.3    1
 3           VOL4                  SERVERA 3:5:1   r     21     21   21     338     338    338 123.95 123.95  16.1  16.1    -
 3           VOL4                  SERVERA 3:5:1   w     67     67   67    4478    4478   4478 372.72 372.72  67.3  67.3    -
 3           VOL4                  SERVERA 3:5:1   t     88     88   88    4816    4816   4816 312.94 312.94  55.0  55.0    3
 4           VOL5                  SERVERA 3:5:1   r      0      0    0       0       0      0   0.00   0.00   0.0   0.0    -
 4           VOL5                  SERVERA 3:5:1   w      0      0    0       0       0      0   0.00   0.00   0.0   0.0    -
 4           VOL5                  SERVERA 3:5:1   t      0      0    0       0       0      0   0.00   0.00   0.0   0.0    0
 5           VOL6                  SERVERA 3:5:1   r      0      0    0       0       0      0   0.00   0.00   0.0   0.0    -
 5           VOL6                  SERVERA 3:5:1   w      0      0    0       0       0      0   0.00   0.00   0.0   0.0    -
 5           VOL6                  SERVERA 3:5:1   t      0      0    0       0       0      0   0.00   0.00   0.0   0.0    0
 6           VOL7                  SERVERA 3:5:1   r      0      0    0       0       0      0   0.00   0.00   0.0   0.0    -
 6           VOL7                  SERVERA 3:5:1   w      0      0    0       0       0      0   0.00   0.00   0.0   0.0    -
 6           VOL7                  SERVERA 3:5:1   t      0      0    0       0       0      0   0.00   0.00   0.0   0.0    0
 0           VOL1                  SERVERA 2:4:2   r    902    902  902  205696  205696 205696 201.94 201.94 228.1 228.1    -
 0           VOL1                  SERVERA 2:4:2   w      0      0    0       2       2      2  28.24  28.24   9.2   9.2    -
 0           VOL1                  SERVERA 2:4:2   t    902    902  902  205698  205698 205698 201.91 201.91 228.0 228.0  205
 

In this example there are a few things jumping out. One is that we are queuing we have a queue length of 117 & 205 down both ports which is affecting our service time > 200ms.

But why? well the answer is pretty easy in this example that the server is reading 200MB/sec from both ports and we know that SERVERA is connected to the SAN at 2Gb where 200MB/sec is the maximum that the HBA can sustain in one direction.

So why does the latency increase ?

Well when the array is sending data to the server we use a credit based system called buffer to buffer credits when transmitting over fibre channel. As fibre channel is a lossless protocol we don’t drop frames so we queue and await a free buffer to be able to send the next part of our data. If we think about when we issue a write we need to wait for the ACK from the array and this ACK joins the back of the queue i.e. the read path from the array to host and this increases latency.

Ok why the queue?

The queue builds up as we can transmit data faster than the server can receive. As most arrays are connected at 8/16Gb these can transfer data faster 4-8x faster than a HBA connected at 2Gb, this is a subscription issue where we have one fast port and one slow port.

** What to track High Queuing / Service Times / Bandwidth ( make sure you know the speed of your servers HBA’s and if they are getting closed to maximum)** 

A awk based parsing script

So we can parse this data pretty easily using awk where we can create associative arrays based on what we are looking for.

BEGIN{
printf("%8s %-22s %1s \t %10s\n","Time","Host","Type","Value");

}
{
if($0 ~ /KBytes/)
{
 t=$1
}
svct=0
iosz=0

if ( $4 ~ /[0-9]:[0-9]:[0-9]/)
{
   iops[$3"\t"$5]+=$6;
   bw[$3"\t"$5]+=$9;

if ( $12 > svc[$3"\t"$5] )
{
 svc[$3"\t"$5]=$12;
 svct=$12
}

if ( $14 > ios[$3"\t"$5] )
{
 ios[$3"\t"$5]=$14;
 iosz=$14
}

if ($5 == "t" )
{
  qu[$3"\tq"]+=$NF;
}
}

if ($0 ~ /^$/)
{

for (io in iops)
{
  split(io,iop,"\t")
  printf("%8s %-22s IOPS_%1s \t %10s\n",t,iop[1],iop[2],iops[io]);
  delete iops[io];
}

for (b in bw)
{
  split(b,mb,"\t")
  printf("%8s %-22s BAND_%1s \t %10s\n",t,mb[1],mb[2],bw[b]/1024);
  delete bw[b]
}
for (q in qu)
{
  split(q,qln,"\t")
  printf("%8s %-22s QLEN_%1s \t %10s\n",t,qln[1],qln[2],qu[q]);
  delete qu[q]
}
for (s in svc)
{
  split(s,lat,"\t")
  printf("%8s %-22s SVCT_%1s \t %10s\n",t,lat[1],lat[2],svc[s]);
  delete svc[s]
}
for (i in ios)
{
  split(i,isz,"\t")
  printf("%8s %-22s IOSZ_%1s \t %10s\n",t,isz[1],isz[2],ios[i]);
  delete ios[i]
}
}
}

We can also use this script with a file by doing the following.

# awk -f ./awk_host statvlun.out |more
Time     Host                  Type           Value
20:01:01 SERVERA               IOPS_r          1856
20:01:01 SERVERA               IOPS_t          1990
20:01:01 SERVERA               IOPS_w           134
20:01:01 SERVERA               BAND_r       400.967
20:01:01 SERVERA               BAND_t       409.705
20:01:01 SERVERA               BAND_w       8.73926
20:01:01 SERVERA               QLEN_q           329
20:01:01 SERVERA               SVCT_r        201.94
20:01:01 SERVERA               SVCT_t        350.34
20:01:01 SERVERA               SVCT_w        410.13
20:02:08 SERVERA               IOPS_r          2152
20:02:08 SERVERA               IOPS_t          2158
20:02:08 SERVERA               IOPS_w             6
20:02:08 SERVERA               BAND_r        398.93
20:02:08 SERVERA               BAND_t       399.664
20:02:08 SERVERA               BAND_w      0.733398
20:02:08 SERVERA               QLEN_q           230
20:02:08 SERVERA               SVCT_r        136.81
20:02:08 SERVERA               SVCT_t        136.81
20:02:08 SERVERA               SVCT_w          7.52
.
.
.
20:23:42 SERVERA               BAND_r       83.8418
20:23:42 SERVERA               BAND_t       127.925
20:23:42 SERVERA               BAND_w        44.082
20:23:42 SERVERA               QLEN_q             0
20:23:42 SERVERA               SVCT_r          6.39
20:23:42 SERVERA               SVCT_t          7.06
20:23:42 SERVERA               SVCT_w          9.88

I think its good to look at the above and see what happens to the response times when we are not queuing or maxing out our HBA’s at the server-side. We can see that we have sub 10ms response times as a maximum over all the volumes allocated to the host.

For volumes we just replace the $3 to be $2 in the awk statement.

20:01:01                       SERVERA-CLSTR-VOL5 IOPS_r                  0
20:01:01                       SERVERA-CLSTR-VOL5 IOPS_t                  0
20:01:01                       SERVERA-CLSTR-VOL5 IOPS_w                  0
20:01:01                       SERVERA-CLSTR-VOL7 IOPS_r                  0
20:01:01                       SERVERA-CLSTR-VOL7 IOPS_t                  0
20:01:01                       SERVERA-CLSTR-VOL7 IOPS_w                  0
20:01:01                       SERVERA-CLSTR-VOL2 IOPS_r                  0
20:01:01                       SERVERA-CLSTR-VOL2 IOPS_t                  0
20:01:01                       SERVERA-CLSTR-VOL2 IOPS_w                  0
20:01:01                       SERVERA-CLSTR-VOL4 IOPS_r                 42
20:01:01                       SERVERA-CLSTR-VOL4 IOPS_t                176
20:01:01                       SERVERA-CLSTR-VOL4 IOPS_w                134
20:01:01                       SERVERA-CLSTR-VOL6 IOPS_r                  0
20:01:01                       SERVERA-CLSTR-VOL6 IOPS_t                  0
20:01:01                       SERVERA-CLSTR-VOL6 IOPS_w                  0
20:01:01                       SERVERA-CLSTR-VOL1 IOPS_r               1806
20:01:01                       SERVERA-CLSTR-VOL1 IOPS_t               1806
20:01:01                       SERVERA-CLSTR-VOL1 IOPS_w                  0
20:01:01                       SERVERA-CLSTR-VOL3 IOPS_r                  8
20:01:01                       SERVERA-CLSTR-VOL3 IOPS_t                  8
20:01:01                       SERVERA-CLSTR-VOL3 IOPS_w                  0
20:01:01                       SERVERA-CLSTR-VOL5 BAND_r                  0
20:01:01                       SERVERA-CLSTR-VOL5 BAND_t                  0
20:01:01                       SERVERA-CLSTR-VOL5 BAND_w                  0
20:01:01                       SERVERA-CLSTR-VOL7 BAND_r                  0
20:01:01                       SERVERA-CLSTR-VOL7 BAND_t                  0
20:01:01                       SERVERA-CLSTR-VOL7 BAND_w                  0
20:01:01                       SERVERA-CLSTR-VOL2 BAND_r                  0
20:01:01                       SERVERA-CLSTR-VOL2 BAND_t                  0
20:01:01                       SERVERA-CLSTR-VOL2 BAND_w                  0
20:01:01                       SERVERA-CLSTR-VOL4 BAND_r           0.669922
20:01:01                       SERVERA-CLSTR-VOL4 BAND_t             9.3877
20:01:01                       SERVERA-CLSTR-VOL4 BAND_w            8.71777
20:01:01                       SERVERA-CLSTR-VOL6 BAND_r                  0
20:01:01                       SERVERA-CLSTR-VOL6 BAND_t                  0
20:01:01                       SERVERA-CLSTR-VOL6 BAND_w                  0
20:01:01                       SERVERA-CLSTR-VOL1 BAND_r            400.237
20:01:01                       SERVERA-CLSTR-VOL1 BAND_t             400.24

How long does it take to parse the data?

$ time awk -f ./awk_host statvlun.out > batch1

real    2m25.767s
user    1m34.406s
sys     0m50.563s

$ wc -l statvlun.out
13659252 statvlun.out

Not bad for parsing 13,659,252 lines of text, however I would then need to post process this to convert into a format for influx unlike the python version where this is all done at once.

$ grep SERVERA batch1 
20:01:01 SERVERA               IOPS_r          1856
20:01:01 SERVERA               IOPS_t          1990
20:01:01 SERVERA               IOPS_w           134
20:01:01 SERVERA               BAND_r       400.967
20:01:01 SERVERA               BAND_t       409.705
20:01:01 SERVERA               BAND_w       8.73926
20:01:01 SERVERA               QLEN_q           329
20:01:01 SERVERA               SVCT_r        201.94
20:01:01 SERVERA               SVCT_t        350.34
20:01:01 SERVERA               SVCT_w        410.13

v.s
$ time ./3par_hosts.py >batch2

real    3m14.091s
user    3m6.891s
sys     0m6.313s

$ grep -i SERVERA batch2 
HostIO,HostIO=servera,type=iops value=1990 1505502061
HostBW,HostBW=servera,type=mb value=409 1505502061
HostRT,HostRT=servera,type=ms value=350.34 1505502061
HostIOSZ,HostIOSZ=servera,type=iosz value=228.0 1505502061
HostQ,HostQ=servera,type=qls value=329 1505502061
HostIO,HostIO=servera,type=riops value=1856 1505502061
HostBW,HostBW=servera,type=rmb value=400 1505502061
HostRT,HostRT=servera,type=rms value=201.94 1505502061
HostIOSZ,HostIOSZ=servera,type=riosz value=228.1 1505502061
HostIO,HostIO=servera,type=wiops value=134 1505502061

Looking at parsing for historical data these tools do the job based on the standard output of the statvlun command however for real-time we need something faster. As such we can use the statvlun command with a specific host with -host or using the -hostsum option to aggregate the volumes per host. If we are looking at volume performance the -vvsum option for aggregated data.

What does this look like when we have it in a influx database?

And below in the same view if we wanted to look at multi volume stats.

In later posts I will make available the grafana dashboard so you can create the same views. If you want to have a start then the scripts are on my github page.
3par Performance Scripts

One thought on “3PAR Performance Monitoring – Hosts & Volumes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s