Using Grafana & Inlfuxdb to view XIV Host Performance Metrics – Part 3 Volumes & Host Ports

Overview

When I started writing this blog I realised that I had more than one parsing script based on the statistics I was looking to gather and as such I have now merged these into four wrapper scripts and single perl script.

This post is really a copy of Hosts but for Volumes with new material on the Host Ports towards the end of the post.

Volumes

Like hosts we can query volumes from the storage array where we build the list and iterate through these and gather our stats. We collect the same metrics as we do for hosts the only change is that instead of HostIO, HostBW etc change Host to Vol an example can be seen in the IOPS section below.

There are a few components to the view I will detail these below.

IOPS

Basically how many Input Output operations we are doing a second. This is broken down into three categories reads, writes and total.

The following SQL is used throughout the dashboard.

SELECT "value" FROM "XIVVolIO" WHERE "type" = 'riops' AND "VolIO" =~ /$HOST/ AND $timeFilter GROUP BY VolIO
SELECT "value" FROM "XIVVolIO" WHERE "type" = 'wiops' AND "VolIO" =~ /$HOST/ AND $timeFilter GROUP BY VolIO
SELECT "value" FROM "XIVVolIO" WHERE "type" = 'tiops' AND "VolIO" =~ /$HOST/ AND $timeFilter GROUP BY VolIO

Breaking this down we select the value which is what we used when writing the data i.e.

XIVVolIO,VolIO=servera,type=tiops value=6 1508886060

We also specify the type which is riops (read IOPS ) we tag the server as HostIO which also allows us to use the same query syntax when looking at the whole array.

We have $Vol which is passed in as a template from a query from Influx which I have used as a template. This allows us to select multiple volumes and compare etc.

BW

Bandwidth basically is the amount of data we are reading or writing to the array. This again is broken down into reads, writes and totals. This value is measured in megabytes/sec.

You will notice the query is similar to the IOPS except we select the tag of rbandw etc.

SELECT "value" FROM "XIVVolBW" WHERE "type" = 'rbandw' AND "VolBW" =~ /$HOST/ AND $timeFilter GROUP BY VolBW
SELECT "value" FROM "XIVVolBW" WHERE "type" = 'wbandw' AND "VolBW" =~ /$HOST/ AND $timeFilter GROUP BY VolBW
SELECT "value" FROM "XIVVolBW" WHERE "type" = 'tbandw' AND "VolBW" =~ /$HOST/ AND $timeFilter GROUP BY VolBW

Latency

Latency is measured in milliseconds , and we break this down into read (rlat) and write ((wlat) latency.

SELECT "value" FROM "XIVVolRT" WHERE "type" = 'rlat' AND "VolRT" =~ /$HOST/ AND $timeFilter GROUP BY VolRT
SELECT "value" FROM "XIVVolRT" WHERE "type" = 'wlat' AND "VolRT" =~ /$HOST/ AND $timeFilter GROUP BY VolRT

Hit & Miss Statistics

Please see the previous post for Hit and Miss stats. A quick overview as follows.

Read Hit – Read from Memory
Read Miss – Read must come from disk
Write Hit – Write is an overwrite of a array cache slot
Write Miss – We write data to a new cache slot.

Read/Write Hit and Miss IOPS

SELECT "value" FROM "XIVVolIORH" WHERE "type" = 'rhiops' AND "VolIO" =~ /$HOST/ AND $timeFilter GROUP BY VolIO
SELECT "value" FROM "XIVVolIORH" WHERE "type" = 'rmiops' AND "VolIO" =~ /$HOST/ AND $timeFilter GROUP BY VolIO
SELECT "value" FROM "XIVVolIORH" WHERE "type" = 'whiops' AND "VolIO" =~ /$HOST$/ AND $timeFilter GROUP BY VolIO
SELECT "value" FROM "XIVVolIORH" WHERE "type" = 'wmiops' AND "VolIO" =~ /$HOST$/ AND $timeFilter GROUP BY VolIO

So we select data from the XIVHostIORH table where the type is rhiops (read hit iops) and rmiops (read miss iops), whiops (write hit iops ) and wmiops (write miss iops)

** Key metric here is rmiops which is the amount of reads we need to get from physical disk for that volume. You could compare this to the same host metrics for all volumes to identify volumes causing slower access to data.

Read/Write Hit and Miss BW

As with IOPS we do the same for Bandwith.

SELECT "value" FROM "XIVVolBWHM" WHERE "type" = 'rmbandw' AND "VolBW" =~ /$HOST/ AND $timeFilter GROUP BY VolBW
SELECT "value" FROM "XIVVoltBWHM" WHERE "type" = 'rhbandw' AND "VolBW" =~ /$HOST/ AND $timeFilter GROUP BY VolBW
SELECT "value" FROM "XIVVolBWHM" WHERE "type" = 'wmbandw' AND "VolBW" =~ /$HOST/ AND $timeFilter GROUP BY VolBW
SELECT "value" FROM "XIVVolBWHM" WHERE "type" = 'whbandw' AND "VolBW" =~ /$HOST/ AND $timeFilter GROUP BY VolBW

So we select data from the XIVVolBWHM table where the type is rmbandw (read miss mb) and rhbandw (read hit mb), wmbandw (write miss bw) and whbandw (write hit mb)

** Key metric here is the read miss BW as this would show the amount of data we would be needing to read from physical disk per volume selected

Read/Write Hit and Miss Latency

SELECT "value" FROM "XIVVolRTHM" WHERE "type" = 'rhit' AND "VolRT" =~ /$HOST/ AND $timeFilter GROUP BY VolRT
SELECT "value" FROM "XIVVolRTHM" WHERE "type" = 'rmiss' AND "VolRT" =~ /$HOST/ AND $timeFilter GROUP BY VolRT
SELECT "value" FROM "XIVVolRTHM" WHERE "type" = 'whit' AND "VolRT" =~ /$HOST/ AND $timeFilter GROUP BY VolRT
SELECT "value" FROM "XIVVolRTHM" WHERE "type" = 'wmiss' AND "VolRT" =~ /$HOST/ AND $timeFilter GROUP BY VolRT

Again we are reading from XIVVolRTHM where we have types of rhit (read hit) , rmiss ( read miss) , whit (write hit) and wmiss (write miss) latency’s.

Note how we GROUP BY the VolRT/VolBW/HostRT/HostIO this allows us to have multiple hosts/ports/volumes selected from our dropdown.

** Key metric here is rmiss latency this is where you will see the pain and if you are experiencing poor performance.

Templating

See previous post for setting up the Volume templates.

The Completed View

Scroll down and we get the rest of the view.

Volume Details

I will leave it up to the reader if you wish to add in Hit and Miss statistics for the volumes. Above we can see an example from array XIVTEST with two volumes being selected.

The JSON template for Grafana XIV Host Dashboard is located on my github repository.

Host Ports

What are host ports well these are the HBAs (Host Bus Adapaters) that are used to connect to the storage area network (SAN) and allow data to be read & written to storage arrays.

Host Ports in our context are Fibre channel (FC) and come in different speeds 2/4/8/16Gbits.

When we install HBA’s into servers we always have two ports these make sure that if we have a failure we don’t loose full access to our disk until a replacement can be found this is a simple n+1 strategy.

Speeds

When looking at host ports it is always good to have a rule of thumb about how much bandwidth we can send/receive. If we work on a minimum of two in each server we should know when we are maxing out.

Single HBA Port
2Gb = 200MB/sec
4Gb = 400MB/sec
8Gb = 800MB/sec
16Gb = 1.6GB/sec

Hold on… most people would say FC is measured in Gigabits and we should divide the speed by 8 so should we not get 256MB/sec for a single 2Gb port.

Well in practice we rarely see a 2Gb HBA being able to push more that 200MB/sec which is why I have supplied these figures and also these are very easy to remember.

So with n+1 we just double the MB to get the max so a 2Gb would be 400MB/sec reads and writes. As with a network FC has rx and tx so we can issue the same amount of bandwidth i.e. full duplex.

Gathering the Host Port Stats

Ok so with all the stats we just run the wrapper script against our array i.e

./xiv_hba <ARRAY> <DAY> <MONTH>

As with above we then start to input the following data into Influx.

print "XIVHPortIO,PortHIO=$port,type=tiops value=" . $total_iops . " $ts\n";
print "XIVHPortIO,PortHIO=$port,type=wiops value=" . $total_writes . "$ts\n";
print "XIVHPortIO,PortHIO=$port,type=riops value=" . $total_reads . " $ts\n";
print "XIVHPortBW,PortHBW=$port,type=tbandw value=$total_mb $ts\n";
print "XIVHPortBW,PortHBW=$port,type=wbandw value=$total_write_mb $ts\n";
print "XIVHPortBW,PortHBW=$port,type=rbandw value=$total_read_mb $ts\n";
print "XIVHPortRT,PortHRT=$port,type=wlat value=$whplusm $ts\n";
print "XIVHPortRT,PortHRT=$port,type=rlat value=$rhplusm $ts\n";

When reading the data out into Grafana we read as follows.

SELECT "value" FROM "XIVHPortBW" WHERE "type" = 'tbandw' AND "PortHBW" =~ /$HOST$/ AND $timeFilter GROUP BY PortHBW
SELECT "value" FROM "XIVHPortIO" WHERE "type" = 'tiops' AND "PortHIO" =~ /$HOST$/ AND $timeFilter GROUP BY PortHIO
SELECT "value" FROM "XIVHPortRT" WHERE "type" = 'wlat' AND "PortHRT" =~ /$HOST$/ AND $timeFilter GROUP BY WriteLat,PortHRT
SELECT "value" FROM "XIVHPortRT" WHERE "type" = 'rlat' AND "PortHRT" =~ /$HOST$/ AND $timeFilter GROUP BY ReadLat,PortHRT

You can notice I am only looking at the totals here rather than a breakdown of the IOPS or BW. The reasoning I have for this is that I want to make sure I have a balance i.e. both HBAs should be pushing similar IOPS and BW.

Noting we have 2HBA’s in the image green and yellow we can see we are balanced which is what we are looking for.

Imbalance

When looking at balance we need to make sure that the HBA’s are connected to the SAN and that they are zoned to the array which allows the disk to be visible so if we see an imbalance we have the following checklist to follow.

  1. Check Server HBA has not failed.
  2. Check Switch port is online for Server HBA
  3. Check Zoning that the HBA is zoned to Array
  4. Check multi-pathing at OS ( Should be round-robin)
  5. If VMWARE check that the path selection policy is not 1000 and set to 1 I/O down each path.

In general that is about it for Host ports I will cover off some more details after my UKOUG Tech17 talk in December on how these can affect performance.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s