Using Grafana & Inlfuxdb to view XIV Host Performance Metrics – Part 1

Overview

When looking at performance data I like to get an overall view of what all of my hosts are doing on our arrays and historically the XIV GUI only allowed up to 12 hosts to be displayed at any point in time.  As such finding busy hosts based on high IOPS or Bandwidth was a cumbersome exercise.

The original versions of these scripts were used to parse all hosts and then create a file which could be imported into excel, this however proved to be an issue with many hosts > 100 and actually finding noisy hosts based based on stacked views was cumbersome which is why this tooling was developed.

The other key thing to take from this is that we now have a tool with which we can look at all hosts and performance in one view, we also have access to influx and can report on hosts who are breaching IOPS limits or getting close to HBA saturation.

In this first post I will cover off the initial install of Grafana and Influxdb and how we can parse host data initially and then in further posts how we can parse overall array data and volume based metrics.

Installing grafana and Influxdb

The install of these tools is very easy for Grafana and Influx we can download the rpms from the websites or we you can install from your rpm source.

Influx

wget https://dl.influxdata.com/influxdb/releases/influxdb-1.3.6.x86_64.rpm
sudo yum localinstall influxdb-1.3.6.x86_64.rpm

Grafana

$ wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.5.2-1.x86_64.rpm
$ sudo yum install initscripts fontconfig
$ sudo rpm -Uvh grafana-4.5.2-1.x86_64.rpm

Ok so once we have it installed we need to login to influx and create a database to store our data

root@henky:~# sevice influxdb start 
root@henky:~# sevice grafana-server start
root@henky:~# influx
Connected to http://localhost:8086 version 1.3.6
InfluxDB shell version: 1.3.6
> show databases
name: databases
name
----
_internal
> create database XIV

Once we have installed the services and made sure they are started you should be able to go to your server ip address on port 3000 and be greeted with the Grafana login page.

grafana_login_page

Statistics_get command.

The statistics get command requires the xcli tool to be installed, I wont be covering the install as I assume that anyone wanting to use this for monitoring will already have access to this tool.

When running the statistics_get command we need to pass in the following:

  • start time
  • count
  • resolution_unit
  • interval.
  • host/local_fc_port/volume

We then add in say a fc_port or a host etc depending on the data we require.

/usr/bin/xcli -m $xiv -s -u $user -p $pass statistics_get start=${year}-${month}-${day}.0:${stime}:00 count=1440 resolution_unit=minute interval=1 host=${i} > ${xiv}_port/${i}_${month}_${day}_${xiv}.csv

So lets parse the above command and the variables.

$xiv   = XIV Array 
$user  = Username 
$pass  = Password
$year  = Year 
$month = Month 
$day   = Day
$stime = start time 
count  = 1440 ( Number of minutes in a day ) 
resolution_unit=minute ( Sets the unit of measurement for the length of each bin.) 
interval=1 ( The length of time in each statistic’s time point )
host = $i ( host name )

So once we run this command we get data returned in CSV format which is detailed here: Statistics_get_command

This is described as follows:

” This command lists I/O statistics. The count parameter sets the number of lines in the statistics report. Together, the interval and resolution_unit set the length of time for each statistics line. Either start or end timestamps must be provided. These timestamps set the time for the statistics report. Other parameters restrict statistics to a specific host, host port, volume, interface port and so on.  For each line of statistics, 48 numbers are reported, which represent all the combinations of reads/writes, hits/misses and I/O size reporting for each of the 16 options for bandwidth, IOPS and latency.”

So what do we do with the data once we have it.. well we need to break down each of the 16 options for each type and input this data into our database.

The way I parse the data is that I use an array which allows me to select the appropriate data based on the below table. Note this is just an example of some of the data.

As array’s start at 0 we want to start with what I have called Array Position as opposed to the documents Default Position.

Id Name Array Position Default Position
time Time  0 1
failures Failures
aborts Aborts
read_hit_
very_large_
iops
Read Hit Very large – IOps  1 2
read_hit_
very_large_
latency
Read Hit Very large – Latency  2 3
read_hit_
very_large_
throughput
Read Hit Very large – Throughput  3 4
read_hit_
large_iops
Read Hit Large – IOps  4 5
read_hit_
large_latency
Read Hit Large – Latency  5 6
read_hit_
large_
throughput
Read Hit Large – Throughput  6 7
read_hit_
medium_iops
Read Hit Medium – IOps  7 8
read_hit_
medium_latency
Read Hit Medium – Latency  8 9
read_hit_
medium_
throughput
Read Hit Medium – Throughput  9 10
read_hit_
small_iops
Read Hit Small – IOps  10 11

So as an example for say read IOPS for a host we need to take the following values based on the perl language.

$read_hit_iops=$list[1]+$list[4]+$list[7]+$list[10];
$read_miss_iops=$list[13]+$list[16]+$list[19]+$list[22];
$read_mem_hit_iops=$list[49]+$list[52]+$list[55]+$list[58];

So as you can see gathering the appropriate data for IOPS is really easy we just sum up the appropriate array values.

Below we can see how we get the total IOPS data.

$total_iops=($read_hit_iops+$read_miss_iops+$write_hit_iops+$write_miss_iops);

Ok so lets break down each part and then look at what this looks like in our tool.

IOPS

$read_hit_iops=$list[1]+$list[4]+$list[7]+$list[10];
$read_miss_iops=$list[13]+$list[16]+$list[19]+$list[22];
$write_hit_iops=$list[25]+$list[28]+$list[31]+$list[34];
$write_miss_iops=$list[37]+$list[40]+$list[43]+$list[46];
$read_mem_hit_iops=$list[49]+$list[52]+$list[55]+$list[58];
# Get totals 
$total_iops=($read_hit_iops + $read_miss_iops + $write_hit_iops + $write_miss_iops);
$total_riops=($read_hit_iops + $read_miss_iops);
$total_wiops=($write_hit_iops + $write_miss_iops);Latency

Latency

We work out latency as follows: latency * ( latency_iops_type / total_type_iops )

# so for wmss
# i.e. write_miss_small * ( write_miss_small_latency / total_write_miss_iops) 
# and repeat for each type to get the miss latency total.

sub get_write_miss_latency {
my $total_writes = shift;
 if ( $total_writes > 0 )
 {
 # latency * ( latency_iops_type / total_type_iops )
 $wmss = $list[47] * ( $list[46] / $total_writes ) if $list[46] ;
 $wmms = $list[44] * ( $list[43] / $total_writes ) if $list[43] ;
 $wmls = $list[41] * ( $list[40] / $total_writes ) if $list[40] ;
 $wmvls = $list[38] * ( $list[37] / $total_writes ) if $list[37];
 $wmisslat = ( $wmss + $wmms + $wmls + $wmvls ) / 1000 ;
 }
 else
 { return 0; }

}
sub get_write_hit_latency {
 my $total_writes = shift;
 if ( $total_writes > 0 )
 {
 $whss = $list[35] * ( $list[34] / $total_writes ) if $list[34] ;
 $whms = $list[32] * ( $list[31] / $total_writes ) if $list[31] ;
 $whls = $list[29] * ( $list[28] / $total_writes ) if $list[28] ;
 $whvls = $list[26] * ( $list[25] / $total_writes ) if $list[25];
 $whisslat = ( $whss + $whms + $whls + $whvls ) / 1000 ;
 }
 else
 { return 0; }

}

sub get_read_miss_latency {
 my $total_reads = shift;
 if ( $total_reads > 0 )
 {
 $rmss = $list[23] * ( $list[22] / $total_reads ) if $list[22] ;
 $rmms = $list[20] * ( $list[19] / $total_reads ) if $list[19] ;
 $rmls = $list[17] * ( $list[16] / $total_reads ) if $list[16] ;
 $rmvls = $list[14] * ( $list[13] / $total_reads ) if $list[13];

$rmisslat = ( $rmss + $rmms + $rmls + $rmvls ) / 1000 ;
 }
 else
 { return 0; }

}

sub get_read_hit_latency {
 my $total_reads = shift;
 if ( $total_reads > 0 )
 {

 $rhss = $list[11] * ( $list[10] / $total_reads ) if $list[10] ;
 $rhms = $list[8] * ( $list[7] / $total_reads ) if $list[7] ;
 $rhls = $list[5] * ( $list[4] / $total_reads ) if $list[4] ;
 $rhvls = $list[2] * ( $list[1] / $total_reads ) if $list[1];

$rhisslat = ( $rhss + $rhms + $rhls + $rhvls ) / 1000 ;
 }
 else
 { return 0; }

}

# We then can call the functions to get the values required.
$total_writes = $write_miss_iops + $write_hit_iops ;
my $w_hit_lat = get_write_hit_latency($write_hit_iops);
my $w_miss_lat = get_write_miss_latency($write_miss_iops);
my $total_whit_lat = get_write_hit_latency($total_writes);
my $total_wmiss_lat = get_write_miss_latency($total_writes);
$whplusm = ( $total_whit_lat + $total_wmiss_lat ) ;

$total_reads = $read_miss_iops + $read_hit_iops;
my $r_hit_lat = get_read_hit_latency($read_hit_iops);
my $r_miss_lat = get_read_miss_latency($read_miss_iops);
my $total_rhit_lat = get_read_hit_latency($total_reads);
my $total_rmiss_lat = get_read_miss_latency($total_reads);
$rhplusm = ( $total_rhit_lat + $total_rmiss_lat ) ;

Bandwidth

We work out the throughput as follows using the same method for I/O and Latency.

$read_hit_mb=($list[3]+$list[6]+$list[9]+$list[12]) /1024;
$read_miss_mb=($list[15]+$list[18]+$list[21]+$list[24]) /1024;

$write_hit_mb=($list[27]+$list[30]+$list[33]+$list[36])/1024;
$write_miss_mb=($list[39]+$list[42]+$list[45]+$list[48])/1024;

$total_read_mb=($read_hit_mb+$read_miss_mb);
$total_write_mb=($write_hit_mb+$write_miss_mb);
$total_mb=$total_read_mb+$total_write_mb;

Parsing and Printing the data.

To parse the data for whatever type we are parsing say Hostname / Port etc we need to pull that out from the initial csv file , we need this to use in our appropriate print statements

open(XIV,$file) || die("Could not open file $file \n");
my @name = split("\_",$file);
my $port=lc $name[1];
$port = ~s/.*\///g;
# get the timestamp from the data and remove any surrounding ""
$ts=$list[61];
$ts = ~s/"//g;

Now we can print the data which we can the input into influxdb

# IOPS
 print "XIVHostIO,PortIO=$port,type=tiops value=$total_iops $ts\n";
 print "XIVHostIO,PortIO=$port,type=riops value=$total_riops $ts\n";
 print "XIVHostIO,PortIO=$port,type=wiops value=$total_wiops $ts\n";
 print "XIVHostIORH,PortIO=$port,type=rmiops value=$read_miss_iops $ts\n";
 print "XIVHostIORH,PortIO=$port,type=wmiops value=$write_miss_iops $ts\n";
 print "XIVHostIORH,PortIO=$port,type=rhiops value=$read_hit_iops $ts\n";
 print "XIVHostIORH,PortIO=$port,type=whiops value=$write_hit_iops $ts\n";

# BW
 print "XIVHostBW,PortBW=$port,type=tbandw value=$total_mb $ts\n";
 print "XIVHostBW,PortBW=$port,type=rbandw value=$total_read_mb $ts\n";
 print "XIVHostBW,PortBW=$port,type=wbandw value=$total_write_mb $ts\n";
 print "XIVHostBWHM,PortBW=$port,type=rmbandw value=$read_miss_mb $ts\n";
 print "XIVHostBWHM,PortBW=$port,type=wmbandw value=$write_miss_mb $ts\n";
 print "XIVHostBWHM,PortBW=$port,type=rhbandw value=$read_hit_mb $ts\n";
 print "XIVHostBWHM,PortBW=$port,type=whbandw value=$write_hit_mb $ts\n";

#LATENCY
 print "XIVHostRT,PortRT=$port,type=wlat value=$whplusm $ts\n";
 print "XIVHostRTHM,PortRT=$port,type=wmiss value=$w_miss_lat $ts\n";
 print "XIVHostRTHM,PortRT=$port,type=whit value=$w_hit_lat $ts\n";
 print "XIVHostRT,PortRT=$port,type=rlat value=$rhplusm $ts\n";
 print "XIVHostRTHM,PortRT=$port,type=rmiss value=$r_miss_lat $ts\n";
 print "XIVHostRTHM,PortRT=$port,type=rhit value=$r_hit_lat $ts\n";

Once we have parsed the data we get something which looks like this.

XIVHostIO,PortIO=servera,type=tiops value=6 1508886060 
XIVHostIO,PortIO=servera,type=riops value=4 1508886060 
XIVHostIO,PortIO=servera,type=wiops value=2 1508886060 
XIVHostIORH,PortIO=servera,type=rmiops value=0 1508886060 
XIVHostIORH,PortIO=servera,type=wmiops value=0 1508886060 
XIVHostIORH,PortIO=servera,type=rhiops value=4 1508886060 
XIVHostIORH,PortIO=servera,type=whiops value=2 1508886060 
XIVHostBW,PortBW=servera,type=tbandw value=0.0185546875 1508886060 
XIVHostBW,PortBW=servera,type=rbandw value=0.01171875 1508886060 
XIVHostBW,PortBW=servera,type=wbandw value=0.0068359375 1508886060 
XIVHostBWHM,PortBW=servera,type=rmbandw value=0 1508886060 
XIVHostBWHM,PortBW=servera,type=wmbandw value=0 1508886060 
XIVHostBWHM,PortBW=servera,type=rhbandw value=0.01171875 1508886060 
XIVHostBWHM,PortBW=servera,type=whbandw value=0.0068359375 1508886060 
XIVHostRT,PortRT=servera,type=wlat value=0.293 1508886060 
XIVHostRTHM,PortRT=servera,type=wmiss value=0 1508886060 
XIVHostRTHM,PortRT=servera,type=whit value=0.293 1508886060 
XIVHostRT,PortRT=servera,type=rlat value=0.061 1508886060 
XIVHostRTHM,PortRT=servera,type=rmiss value=0 1508886060 
XIVHostRTHM,PortRT=servera,type=rhit value=0.061 1508886060

Sending the data to influx

To send the data all we need to do is use the curl command to send the write steam over http.  This assumes you have written the above output into a file named data.

POST="http://localhost:8086/write?db=XIV&precision=s"

/usr/bin/curl -i -XPOST ${POST} --data-binary @data

Once we have sent the data we can start to create a dashboard using Grafana. First we need to check we got the data. So in influx we can see we have some data.

 > show measurements
name: measurements
name
----
XIVHostBW
XIVHostBWHM
XIVHostIO
XIVHostIORH
XIVHostRT
XIVHostRTHM
> show series
key
---
XIVHostBW,PortBW=servera,type=rbandw
XIVHostBW,PortBW=servera,type=tbandw
XIVHostBW,PortBW=servera,type=wbandw
XIVHostBWHM,PortBW=servera,type=rhbandw
XIVHostBWHM,PortBW=servera,type=rmbandw
XIVHostBWHM,PortBW=servera,type=whbandw
XIVHostBWHM,PortBW=servera,type=wmbandw
XIVHostIO,PortIO=servera,type=riops
XIVHostIO,PortIO=servera,type=tiops
XIVHostIO,PortIO=servera,type=wiops
XIVHostIORH,PortIO=servera,type=rhiops
XIVHostIORH,PortIO=servera,type=rmiops
XIVHostIORH,PortIO=servera,type=whiops
XIVHostIORH,PortIO=servera,type=wmiops
XIVHostRT,PortRT=servera,type=rlat
XIVHostRT,PortRT=servera,type=wlat
XIVHostRTHM,PortRT=servera,type=rhit
XIVHostRTHM,PortRT=servera,type=rmiss
XIVHostRTHM,PortRT=servera,type=whit
XIVHostRTHM,PortRT=servera,type=wmiss

Creating a first dashboard

In Grafana once logged in as admin/admin we need to add our Data Source for the database. As this is local to my VM it will be localhost where the DB name matches what you created in influx.

[ Select Grafana Logo -> Data Sources ]

XIV_datasource

After creating the data source we can then create our host based dashboard by selecting home-> New Dashboard.

create_dashWe now need to select the graph logo and we get the following.

default_graphNow click on the word Panel Title and select edit, once in edit mode from the datasource drop down select XIV, we can then build our Query.

create_first_view

We can now edit this to get the data we require at the right had side there is a Toggle Edit Mode which allows us to use text as the SQL i.e.

SELECT mean("value") FROM "measurement" WHERE $timeFilter GROUP BY time($__interval) fill(null)

What we can do is then fill in the appropriate values we want.

IOPS

SELECT max("value") FROM "XIVHostIO" WHERE "type" = 'riops' AND "PortIO" = 'servera' AND $timeFilter GROUP BY time($__interval) fill(null)

Once we have done this we then move to the Display tab and under Stacking and Null value change this to connected.

  1. Under the General Tab name the Chart Read IOPS
  2. Under the Axes tab change the unit from short to data rate->IO/s sec
  3. Finally select at the top back to Dashboard

Once we have a baseline chart we can then duplicate this and change the reads (riops) to writes (wiops) and also total (tiops)

What we then have is as follows.

IOS_WRTWe can the repeat this for Bandwidth and Latency ( remember to change the Axes to the appropriate metrics megabytes and milliseconds )

BandWidth ( rbandw/wbandw/tbandw)

SELECT "value" FROM "XIVHostBW" WHERE "type" = 'rbandw' AND "PortBW" =~ /servera/ AND $timeFilter GROUP BY PortBW

Latency ( rlat/wlat)

SELECT "value" FROM "XIVHostRT" WHERE "type" = 'rlat' AND "PortRT" =~ /severa/ AND $timeFilter GROUP BY PortRT

What then get the following.

all_data

We can the click on the save Icon and name our Dashboard.

Moving forward

With some more work we can then look at adding in multihost using templating and multi arrays. We can also include volumes/HBA and hit and miss statistics.

The wrapper scripts and the parsing script are available here: XIV Performance

7 thoughts on “Using Grafana & Inlfuxdb to view XIV Host Performance Metrics – Part 1

  1. Allan,

    I have dabbled with Grafana, as well as collecting XIV stats, and the xcli tool – just never all at the same time. I do my current polling via SNMP and some perl scripts, but it is fairly rudimentary via MRTG with RRDtool.

    I just discovered your series of posts and am inspired to revisit pulling this data into Influx instead and then presenting via Grafana. Assuming you’re aware of MRTG, you’ll know it defaults to polling data every 5 minutes. I would love more granular than that, and I see the XIV supports interval = 1 for statistics_get, which would be ideal.

    However, your scripts look as if they might be setup to run only once per day (grabbing the 1440 data points from the previous day, per array/host/etc)? I’m curious if you’ve explored polling smaller counts more frequently, for a shorter time window.
    I’m guessing extracting ALL data via those wrapper scripts every minute might be a bit too stressful on the XIV, but perhaps updating every 15 minutes or hourly would be more practical? My desire is to have current data visible, not just previous days… but I also don’t want to crush the XIV management interface!

    Not sure if I should be concerned about that. Similarly, do you run all of your xiv_* wrapper scripts concurrently, or do you spread them out? Any insight you could provide would be much appreciated.

    Also, I’ll assume you’ve updated your wrapper scripts since posting, as the date was hard-coded to 2017 😉

    Thanks for sharing all of your hard work!

    Regards,
    Jason

    Like

    1. Hey Jason

      Thanks for the feedback and note on the scripts 🙂

      The xcli allows 1 min as the lowest resolution and if you increase this the data will be averaged over that timeframe.
      The scripts are set to just gather a whole day just to make things easy and the run-time for me has never been an issue as I just run it a few times each day with a catch all for the previous day say just after middnight.

      As you want to poll the data more frequently the process will complete quicker as the perl parser will have less lines to parse. I generally just run all the scripts concurrently for a single array via a generic wrapper as the majority of these complete quickly where as the hosts,volumes and hba scripts take the longest. One thing to note is that the host port can be 2x more per sample based on servers HBA count. I use this data to make sure we have balance with the I/O policy used on hosts and to diagnose if we have say a imbalance on a zoneset from one fabric to another. This would be evident from the array port data used in the zones not trending closely to each others IOPS. However if your not concerned about those then you could drop host_ports/volumes and gather only when required.

      I would say running hourly would be fine with a cut down sample size and I would not expect any load on the array based on a bunch of queries to gather performance data. It really just takes some testing to see how long it takes to complete on your array. You could do an extrapolation based on the xiv_array data as the same amount is returned for all queries based on sample count and should be linear.

      Feel free to ping me here or DM me on twitter @Al_unix if you have any issues/questions.

      Cheers
      Al

      Like

  2. A few quick notes:

    – I found that statistics_get only needs start *or end* specified. Rather than need to calculate the start of the window, I can just determine current time, and use that as “end.” Then, if I want to poll every 15 minutes, just specify count=15. This simplifies the wrapper date to a single var:

    date=`date ‘+%Y-%m-%d.%H:%M:00’`

    /opt/IBM_XIV_Storage_Management_GUI/xcli -m $xiv -s statistics_get end=$date count=15 resolution_unit=minute interval=1 host=$host

    – There seems to be a delay until the previous minute’s data is fully calculated by the XIV’s management console. I have no idea if there is any consistency, and frankly, not sure I’d trust it to ever be “ready” at a consistent interval. So whether I collect data every 15 minutes or hourly, (I haven’t actually implemented anything yet), it might be best wait a couple minutes (i.e. run via cron at 2,17,32,47) and then subtract two minutes from the current time.

    — Related to this, and I’ll research InfluxDB’s doc, if I instead collect 30 minutes of data every 15 minutes, will Influx just ignore any time data that was already collected and stored if it is attempted to be input again? This would address possible lost data during the collection server patch/reboot maintenance window. I could also then not bother with time shifting the collection.

    – After connecting once under an account’s context, xcli doesn’t seem to require the username/password to be specified again… or at least that is what I have found. That makes it nice, avoiding the need to specify a password in plaintext within scripts.

    Thanks again,
    Jason (@kungfoochef)

    Like

    1. Hey Jason

      As the data is written into influx with a epoch timestamp included from the statistics_get any subsequent writes will just overwrite any data that’s already there.

      Al

      Like

  3. xiv_port line 34: missing $ before {xiv}
    ^ not really a problem, just a clean up item.

    I’m running into a problem with the “XIV SERVER Host and Array View” dashboard. Specifically the Volume selection.

    When I run xiv_vol, it finds 197 volumes and looks like it has entered those into InfluxDB based on the console output. However the dashboard view only displays 55 volumes. I tried adding a regex (Settings -> Variables -> Edit -> Query Options) that matches on volumes not displayed, but the preview fails to show any matches. I’m wondering if its the “show tag values with key = VolIO” — but I’m not sure what else could be used. I did a ‘show measurements’ and tried VolBW instead, same results.

    A select from influx (e.g. select * from XIVVolIO where time = 1521086400000000000) only returns values for those 55 volumes, so i don’t know if the entries for the missing vols ever made it into the database or if they have some other issue. My console log during xiv_vol shows “working on ” for those that are missing and indicates results were found, and the influx interface responds as if data was acquired.

    Separately, I’m also failing to see any data in the xiv-server-host-and-array-view dashboard. I adjusted the datasource variable, where you had a regex for /XIV|TEST/ or something like that… and it then returns my xiv in the preview. I’d rather focus on the above issue before diving into this one, but if you have any quick thoughts I could check?

    Thanks again for publishing this. The acquisition / formatting scripts made this really easy to spin up quickly!

    Regards,
    J

    Like

    1. Hey Jason

      The script for volumes will only enter the data if the script finds a time stamp variable in the csv file so it may be that the missing volumes have no i/o. If you can check the statistics get results for one of the missing volumes it should state if there was any data.

      Like

  4. Thanks, I will run a few custom get_statistics when I get a chance. There is normally a ton of IO on the missing volumes.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s