Using Grafana & Inlfuxdb to view XIV Host Performance Metrics – Part 4 Array Stats

Overview

This post aims to cover overall array stats. We will create one dashboard to show the overall array stats and then we will break this down into a secondary dashboard focusing on individual array ports and how we need these to identify workloads driving IOPS and BW, this view will also help identify if we are maintaining balance across ports making the best use of the components.

Array stats

Looking at the stats already covered we gather the exact same ones as the previous posts but this time it’s an overall array view.

This means that we will have instead of individual host IOPS the total IOPS for the storage array. This is good to help us identify busy periods and also if we are close to maxing out our ports capabilities on bandwidth.

If we take an example of a server zoned to say 4 array ports this would allow up to 4GB/sec based on a server and array attached at 8Gb. If we were pushing at peak workload 2.3GB/sec and had a failure within our SAN say a switch failure where we lost two ports we would exceed the capability of the other two array ports and have a serious performance issue.

We also need to remember that the normal operation on an array is to share ports with many hosts so we can also help identify the combined workloads and determine at peak what scope we have for adding more capacity or noisy neighbours.

IOPS

This is the total IOPS for the storage array and is broken down as follows.

Reads / Writes / Total / Read & Write Hits / Read and Write Miss IOPS

The following SQL is used throughout the dashboard.

SELECT "value" FROM "XIVArrayIO" WHERE "type" = 'riops' AND ArrayIO =~ /$DS/ AND $timeFilter GROUP BY ArrayIO
SELECT "value" FROM "XIVArrayIO" WHERE "type" = 'wiops' AND ArrayIO =~ /$DS/ AND $timeFilter GROUP BY ArrayIO
SELECT "value" FROM "XIVArrayIO" WHERE "type" = 'tiops' AND ArrayIO =~ /$DS/ AND $timeFilter GROUP BY ArrayIO

For Hit and Misses the following SQL.

SELECT "value" FROM "XIVArrayIORH" WHERE "type" = 'rmiops' AND ArrayIO =~ /$DS/ AND $timeFilter GROUP BY ArrayIO,rmiss
SELECT "value" FROM "XIVArrayIORH" WHERE "type" = 'wmiops' AND ArrayIO =~ /$DS/ AND $timeFilter GROUP BY ArrayIO,wmiss
SELECT "value" FROM "XIVArrayIORH" WHERE "type" = 'rhiops' AND ArrayIO =~ /$DS/ AND $timeFilter GROUP BY ArrayIO,rhit
SELECT "value" FROM "XIVArrayIORH" WHERE "type" = 'whiops' AND ArrayIO =~ /$DS/ AND $timeFilter GROUP BY ArrayIO,rmiss

BW

Bandwidth for the whole array we have the same as IOPS but for BW.

Read MB / Write MB / Total MB / Read & Write Hit MB / Read and Write Miss MB

The SQL we use to get the data is as follows.

SELECT "value" FROM "XIVArrayBW" WHERE "type" = 'rbandw' AND ArrayBW =~ /$DS/ AND $timeFilter GROUP BY ArrayBW
SELECT "value" FROM "XIVArrayBW" WHERE "type" = 'wbandw' AND ArrayBW =~ /$DS/ AND $timeFilter GROUP BY ArrayBW
SELECT "value" FROM "XIVArrayBW" WHERE "type" = 'tbandw' AND ArrayBW =~ /$DS/ AND $timeFilter GROUP BY ArrayBW

For Hit and Misses the following SQL.

SELECT "value" FROM "XIVArrayBWHM" WHERE "type" = 'rmbandw' AND ArrayBW =~ /$DS/ AND $timeFilter GROUP BY ArrayBW,rmiss_mb
SELECT "value" FROM "XIVArrayBWHM" WHERE "type" = 'wmbandw' AND ArrayBW =~ /$DS/ AND $timeFilter GROUP BY ArrayBW,wmiss_mb
SELECT "value" FROM "XIVArrayBWHM" WHERE "type" = 'rhbandw' AND ArrayBW =~ /$DS/ AND $timeFilter GROUP BY ArrayBW,rhit_mb
SELECT "value" FROM "XIVArrayBWHM" WHERE "type" = 'whbandw' AND ArrayBW =~ /$DS/ AND $timeFilter GROUP BY ArrayBW,whit_mb

Latency

Latency is measured in milliseconds , and we break this down into read (rlat) and write ((wlat) latency and read/write miss and read/write hit.

SELECT "value" FROM "XIVArrayRT" WHERE "type" = 'rlat' AND ArrayRT =~ /$DS/ AND $timeFilter GROUP BY ArrayRT
SELECT "value" FROM "XIVArrayRT" WHERE "type" = 'wlat' AND ArrayRT =~ /$DS/ AND $timeFilter GROUP BY ArrayRT

Read/Write Hit and Miss Latency.

SELECT "value" FROM "XIVArrayRTHM" WHERE "type" = 'rmiss' AND ArrayRT =~ /$DS/ AND $timeFilter GROUP BY ArrayRT,rmiss_ms
SELECT "value" FROM "XIVArrayRTHM" WHERE "type" = 'wmiss' AND ArrayRT =~ /$DS/ AND $timeFilter GROUP BY ArrayRT,wmiss_ms
SELECT "value" FROM "XIVArrayRTHM" WHERE "type" = 'rhit' AND ArrayRT =~ /$DS/ AND $timeFilter GROUP BY ArrayRT,rhit_ms
SELECT "value" FROM "XIVArrayRTHM" WHERE "type" = 'whit' AND ArrayRT =~ /$DS/ AND $timeFilter GROUP BY ArrayRT,whit_ms

Array View

I have omitted the legends on the graphs to not show the array names however the yellow is for write and green is for read when looking at hit/miss statistics.

The json template, updated wrapper and parser for Array Performance Overview is located in my github repository.

To get the stats we run the xiv_array wrapper which calls statistics_get without any option i.e. no Host/Vol etc.

 i.e.
./xiv_array XIVTEST 07 11

Array Ports

As many hosts connect to array ports as part of zoning which allows disk to be allocated to servers we normally have hosts utilising say 2 or 4 array ports depending on the storage subsystem. On an Gen3 XIV this can be as much as 6 array ports depending on the module count.

Connectivity and Balance

Now is probably a good time to mention balance again. As with host ports where we want to make sure we have I/O being sent out via both HBA’s in a round-robin nature we also want to make sure that we balance the use of the array ports evenly across modules.

If we zone Array ports to servers without thinking of balance we can dig ourselves into a pretty deep hole when services start to push higher amounts of bandwidth and IOPS and as a result lead to increased latency and poor overall performance. There are many redbooks which talk about the best practices with connectivity which I will leave as an exercise for the reader.

What the array port view gives us over the overall Array view is that we get the breakdown of where the increases in IOPS/BW/Latency are rather than an average view across all ports and with this view we include an all hosts view which allows us again to look at high usage and identify which host or hosts are driving the workload.

We can then use the other views to look at the workload of the hosts in more detail over a longer period of time to map the workload characteristics.

Speeds

Again it is good to have a reminder of speeds for Array Ports and know what the limits are as cumulative workloads can also be a cause of saturation.

Single Array Port speeds to BW(MB/sec):
4Gb = 400MB/sec ** XIV
8Gb = 800MB/sec ** XIV
16Gb = 1.6GB/sec ** A9000R

Gathering the Array Port Stats

Ok so with all the stats we just run the wrapper script against our array i.e

 i.e.
 ./xiv_port XIVTEST 07 11

As with our other examples we then start to input the following data into Influx.

print "XIVPortIO,PortIO=$port,type=tiops value=" . $total_iops . " $ts\n";
print "XIVPortIO,PortIO=$port,type=wiops value=" . $total_writes . "$ts\n";
print "XIVPortIO,PortIO=$port,type=riops value=" . $total_reads . " $ts\n";
print "XIVPortBW,PortBW=$port,type=tbandw value=$total_mb $ts\n";
print "XIVPortBW,PortBW=$port,type=wbandw value=$total_write_mb $ts\n";
print "XIVPortBW,PortBW=$port,type=rbandw value=$total_read_mb $ts\n";
print "XIVPortRT,PortRT=$port,type=wlat value=$whplusm $ts\n";
print "XIVPortRT,PortRT=$port,type=rlat value=$rhplusm $ts\n";

The resulting data looks like the following prior to being inserted into the database

XIVPortIO,PortIO=4:4,type=tiops value=2076 1510704000
XIVPortIO,PortIO=4:4,type=riops value=843 1510704000
XIVPortIO,PortIO=4:4,type=wiops value=1233 1510704000
XIVPortIORH,PortIO=4:4,type=rmiops value=653 1510704000
XIVPortIORH,PortIO=4:4,type=wmiops value=90 1510704000
XIVPortIORH,PortIO=4:4,type=rhiops value=190 1510704000
XIVPortIORH,PortIO=4:4,type=whiops value=1143 1510704000
XIVPortBW,PortBW=4:4,type=tbandw value=52.673 1510704000
XIVPortBW,PortBW=4:4,type=rbandw value=21.964 1510704000
XIVPortBW,PortBW=4:4,type=wbandw value=30.709 1510704000
XIVPortBWHM,PortBW=4:4,type=rmbandw value=16.286 1510704000
XIVPortBWHM,PortBW=4:4,type=wmbandw value=1.348 1510704000
XIVPortBWHM,PortBW=4:4,type=rhbandw value=5.678 1510704000
XIVPortBWHM,PortBW=4:4,type=whbandw value=29.361 1510704000
XIVPortRT,PortRT=4:4,type=wlat value=1.38339416058394 1510704000
XIVPortRT,PortRT=4:4,type=rlat value=0.623431791221827 1510704000
XIVPortRTHM,PortRT=4:4,type=wmiss value=1.06836666666667 1510704000
XIVPortRTHM,PortRT=4:4,type=whit value=1.40819947506562 1510704000
XIVPortRTHM,PortRT=4:4,type=rmiss value=0.720883614088821 1510704000
XIVPortRTHM,PortRT=4:4,type=rhit value=0.288505263157895 1510704000

I wont go into metric detail as it is the same as the Array/Host etc so lets take a look at the view.

In this image of the overall view we have:
Ports – total IOPS per Port
IOPS- Sum of all ports IOPS & Read/Write Breakdown per Port
Latency – Per Port Read/Write latency
BW – Total & Breakdown of ports BW Read/Write
Hosts – Breakdown of all hosts on the array IOPS/BW/Latency by Total/Read/Write.

So taking an example from the data in the above image we can see that we are balanced across the ports 1:1/2:2/3:3/4:4 which are four ports used in a specific pattern for host allocations,  looking at the other pattern we also see a good balance in the amount of IOPS.

The **key thing when viewing these charts is to look for all of the lines in the zoning patterns you use to be close together which shows balance an imbalance would point to hosts being incorrectly setup for multi-pathing or zoning.

In the second image below we can again see we have balance in our overall throughput.

If we were to take the spikes in write latency seen on the ports we can see only one port is peaking every hour this would point to a micro burst of activity down one path in a round-robin setup. So lets look at the overall hosts view and see if we can spot the offending host.

We can see that a host was bursting on writes up to 72MB/sec for a short period we can see the red line where we are highlighting the host write spike matches that of the Port latency increase and the pattern repeats.

In general that is about it for Host ports I will cover off some more details at my UKOUG Tech17 talk in December on how these can affect performance.
The json template, updated wrapper and parser for Array Performance Overview is located in my github repository.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s