Overview
When looking at performance data I like to get an overall view of what all of my hosts are doing on our arrays and historically the XIV GUI only allowed up to 12 hosts to be displayed at any point in time. As such finding busy hosts based on high IOPS or Bandwidth was a cumbersome exercise.
The original versions of these scripts were used to parse all hosts and then create a file which could be imported into excel, this however proved to be an issue with many hosts > 100 and actually finding noisy hosts based based on stacked views was cumbersome which is why this tooling was developed.
The other key thing to take from this is that we now have a tool with which we can look at all hosts and performance in one view, we also have access to influx and can report on hosts who are breaching IOPS limits or getting close to HBA saturation.
In this first post I will cover off the initial install of Grafana and Influxdb and how we can parse host data initially and then in further posts how we can parse overall array data and volume based metrics.
Installing grafana and Influxdb
The install of these tools is very easy for Grafana and Influx we can download the rpms from the websites or we you can install from your rpm source.
wget https://dl.influxdata.com/influxdb/releases/influxdb-1.3.6.x86_64.rpm
sudo yum localinstall influxdb-1.3.6.x86_64.rpm
$ wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.5.2-1.x86_64.rpm
$ sudo yum install initscripts fontconfig
$ sudo rpm -Uvh grafana-4.5.2-1.x86_64.rpm
Ok so once we have it installed we need to login to influx and create a database to store our data
root@henky:~# sevice influxdb start root@henky:~# sevice grafana-server start root@henky:~# influx Connected to http://localhost:8086 version 1.3.6 InfluxDB shell version: 1.3.6 > show databases name: databases name ---- _internal > create database XIV
Once we have installed the services and made sure they are started you should be able to go to your server ip address on port 3000 and be greeted with the Grafana login page.
Statistics_get command.
The statistics get command requires the xcli tool to be installed, I wont be covering the install as I assume that anyone wanting to use this for monitoring will already have access to this tool.
When running the statistics_get command we need to pass in the following:
- start time
- count
- resolution_unit
- interval.
- host/local_fc_port/volume
We then add in say a fc_port or a host etc depending on the data we require.
/usr/bin/xcli -m $xiv -s -u $user -p $pass statistics_get start=${year}-${month}-${day}.0:${stime}:00 count=1440 resolution_unit=minute interval=1 host=${i} > ${xiv}_port/${i}_${month}_${day}_${xiv}.csv
So lets parse the above command and the variables.
$xiv = XIV Array $user = Username $pass = Password $year = Year $month = Month $day = Day $stime = start time count = 1440 ( Number of minutes in a day ) resolution_unit=minute ( Sets the unit of measurement for the length of each bin.) interval=1 ( The length of time in each statistic’s time point ) host = $i ( host name )
So once we run this command we get data returned in CSV format which is detailed here: Statistics_get_command
This is described as follows:
” This command lists I/O statistics. The count parameter sets the number of lines in the statistics report. Together, the interval and resolution_unit set the length of time for each statistics line. Either start or end timestamps must be provided. These timestamps set the time for the statistics report. Other parameters restrict statistics to a specific host, host port, volume, interface port and so on. For each line of statistics, 48 numbers are reported, which represent all the combinations of reads/writes, hits/misses and I/O size reporting for each of the 16 options for bandwidth, IOPS and latency.”
So what do we do with the data once we have it.. well we need to break down each of the 16 options for each type and input this data into our database.
The way I parse the data is that I use an array which allows me to select the appropriate data based on the below table. Note this is just an example of some of the data.
As array’s start at 0 we want to start with what I have called Array Position as opposed to the documents Default Position.
Id | Name | Array Position | Default Position |
---|---|---|---|
time | Time | 0 | 1 |
failures | Failures | ||
aborts | Aborts | ||
read_hit_ very_large_ iops |
Read Hit Very large – IOps | 1 | 2 |
read_hit_ very_large_ latency |
Read Hit Very large – Latency | 2 | 3 |
read_hit_ very_large_ throughput |
Read Hit Very large – Throughput | 3 | 4 |
read_hit_ large_iops |
Read Hit Large – IOps | 4 | 5 |
read_hit_ large_latency |
Read Hit Large – Latency | 5 | 6 |
read_hit_ large_ throughput |
Read Hit Large – Throughput | 6 | 7 |
read_hit_ medium_iops |
Read Hit Medium – IOps | 7 | 8 |
read_hit_ medium_latency |
Read Hit Medium – Latency | 8 | 9 |
read_hit_ medium_ throughput |
Read Hit Medium – Throughput | 9 | 10 |
read_hit_ small_iops |
Read Hit Small – IOps | 10 | 11 |
So as an example for say read IOPS for a host we need to take the following values based on the perl language.
$read_hit_iops=$list[1]+$list[4]+$list[7]+$list[10];
$read_miss_iops=$list[13]+$list[16]+$list[19]+$list[22]; $read_mem_hit_iops=$list[49]+$list[52]+$list[55]+$list[58];
So as you can see gathering the appropriate data for IOPS is really easy we just sum up the appropriate array values.
Below we can see how we get the total IOPS data.
$total_iops=($read_hit_iops+$read_miss_iops+$write_hit_iops+$write_miss_iops);
Ok so lets break down each part and then look at what this looks like in our tool.
IOPS
$read_hit_iops=$list[1]+$list[4]+$list[7]+$list[10]; $read_miss_iops=$list[13]+$list[16]+$list[19]+$list[22]; $write_hit_iops=$list[25]+$list[28]+$list[31]+$list[34]; $write_miss_iops=$list[37]+$list[40]+$list[43]+$list[46]; $read_mem_hit_iops=$list[49]+$list[52]+$list[55]+$list[58]; # Get totals $total_iops=($read_hit_iops + $read_miss_iops + $write_hit_iops + $write_miss_iops); $total_riops=($read_hit_iops + $read_miss_iops); $total_wiops=($write_hit_iops + $write_miss_iops);Latency
Latency
We work out latency as follows: latency * ( latency_iops_type / total_type_iops )
# so for wmss # i.e. write_miss_small * ( write_miss_small_latency / total_write_miss_iops) # and repeat for each type to get the miss latency total. sub get_write_miss_latency { my $total_writes = shift; if ( $total_writes > 0 ) { # latency * ( latency_iops_type / total_type_iops ) $wmss = $list[47] * ( $list[46] / $total_writes ) if $list[46] ; $wmms = $list[44] * ( $list[43] / $total_writes ) if $list[43] ; $wmls = $list[41] * ( $list[40] / $total_writes ) if $list[40] ; $wmvls = $list[38] * ( $list[37] / $total_writes ) if $list[37]; $wmisslat = ( $wmss + $wmms + $wmls + $wmvls ) / 1000 ; } else { return 0; } } sub get_write_hit_latency { my $total_writes = shift; if ( $total_writes > 0 ) { $whss = $list[35] * ( $list[34] / $total_writes ) if $list[34] ; $whms = $list[32] * ( $list[31] / $total_writes ) if $list[31] ; $whls = $list[29] * ( $list[28] / $total_writes ) if $list[28] ; $whvls = $list[26] * ( $list[25] / $total_writes ) if $list[25]; $whisslat = ( $whss + $whms + $whls + $whvls ) / 1000 ; } else { return 0; } } sub get_read_miss_latency { my $total_reads = shift; if ( $total_reads > 0 ) { $rmss = $list[23] * ( $list[22] / $total_reads ) if $list[22] ; $rmms = $list[20] * ( $list[19] / $total_reads ) if $list[19] ; $rmls = $list[17] * ( $list[16] / $total_reads ) if $list[16] ; $rmvls = $list[14] * ( $list[13] / $total_reads ) if $list[13]; $rmisslat = ( $rmss + $rmms + $rmls + $rmvls ) / 1000 ; } else { return 0; } } sub get_read_hit_latency { my $total_reads = shift; if ( $total_reads > 0 ) { $rhss = $list[11] * ( $list[10] / $total_reads ) if $list[10] ; $rhms = $list[8] * ( $list[7] / $total_reads ) if $list[7] ; $rhls = $list[5] * ( $list[4] / $total_reads ) if $list[4] ; $rhvls = $list[2] * ( $list[1] / $total_reads ) if $list[1]; $rhisslat = ( $rhss + $rhms + $rhls + $rhvls ) / 1000 ; } else { return 0; } } # We then can call the functions to get the values required. $total_writes = $write_miss_iops + $write_hit_iops ; my $w_hit_lat = get_write_hit_latency($write_hit_iops); my $w_miss_lat = get_write_miss_latency($write_miss_iops); my $total_whit_lat = get_write_hit_latency($total_writes); my $total_wmiss_lat = get_write_miss_latency($total_writes); $whplusm = ( $total_whit_lat + $total_wmiss_lat ) ; $total_reads = $read_miss_iops + $read_hit_iops; my $r_hit_lat = get_read_hit_latency($read_hit_iops); my $r_miss_lat = get_read_miss_latency($read_miss_iops); my $total_rhit_lat = get_read_hit_latency($total_reads); my $total_rmiss_lat = get_read_miss_latency($total_reads); $rhplusm = ( $total_rhit_lat + $total_rmiss_lat ) ;
Bandwidth
We work out the throughput as follows using the same method for I/O and Latency.
$read_hit_mb=($list[3]+$list[6]+$list[9]+$list[12]) /1024; $read_miss_mb=($list[15]+$list[18]+$list[21]+$list[24]) /1024; $write_hit_mb=($list[27]+$list[30]+$list[33]+$list[36])/1024; $write_miss_mb=($list[39]+$list[42]+$list[45]+$list[48])/1024; $total_read_mb=($read_hit_mb+$read_miss_mb); $total_write_mb=($write_hit_mb+$write_miss_mb); $total_mb=$total_read_mb+$total_write_mb;
Parsing and Printing the data.
To parse the data for whatever type we are parsing say Hostname / Port etc we need to pull that out from the initial csv file , we need this to use in our appropriate print statements
open(XIV,$file) || die("Could not open file $file \n"); my @name = split("\_",$file); my $port=lc $name[1]; $port = ~s/.*\///g; # get the timestamp from the data and remove any surrounding "" $ts=$list[61]; $ts = ~s/"//g;
Now we can print the data which we can the input into influxdb
# IOPS print "XIVHostIO,PortIO=$port,type=tiops value=$total_iops $ts\n"; print "XIVHostIO,PortIO=$port,type=riops value=$total_riops $ts\n"; print "XIVHostIO,PortIO=$port,type=wiops value=$total_wiops $ts\n"; print "XIVHostIORH,PortIO=$port,type=rmiops value=$read_miss_iops $ts\n"; print "XIVHostIORH,PortIO=$port,type=wmiops value=$write_miss_iops $ts\n"; print "XIVHostIORH,PortIO=$port,type=rhiops value=$read_hit_iops $ts\n"; print "XIVHostIORH,PortIO=$port,type=whiops value=$write_hit_iops $ts\n"; # BW print "XIVHostBW,PortBW=$port,type=tbandw value=$total_mb $ts\n"; print "XIVHostBW,PortBW=$port,type=rbandw value=$total_read_mb $ts\n"; print "XIVHostBW,PortBW=$port,type=wbandw value=$total_write_mb $ts\n"; print "XIVHostBWHM,PortBW=$port,type=rmbandw value=$read_miss_mb $ts\n"; print "XIVHostBWHM,PortBW=$port,type=wmbandw value=$write_miss_mb $ts\n"; print "XIVHostBWHM,PortBW=$port,type=rhbandw value=$read_hit_mb $ts\n"; print "XIVHostBWHM,PortBW=$port,type=whbandw value=$write_hit_mb $ts\n"; #LATENCY print "XIVHostRT,PortRT=$port,type=wlat value=$whplusm $ts\n"; print "XIVHostRTHM,PortRT=$port,type=wmiss value=$w_miss_lat $ts\n"; print "XIVHostRTHM,PortRT=$port,type=whit value=$w_hit_lat $ts\n"; print "XIVHostRT,PortRT=$port,type=rlat value=$rhplusm $ts\n"; print "XIVHostRTHM,PortRT=$port,type=rmiss value=$r_miss_lat $ts\n"; print "XIVHostRTHM,PortRT=$port,type=rhit value=$r_hit_lat $ts\n";
Once we have parsed the data we get something which looks like this.
XIVHostIO,PortIO=servera,type=tiops value=6 1508886060 XIVHostIO,PortIO=servera,type=riops value=4 1508886060 XIVHostIO,PortIO=servera,type=wiops value=2 1508886060 XIVHostIORH,PortIO=servera,type=rmiops value=0 1508886060 XIVHostIORH,PortIO=servera,type=wmiops value=0 1508886060 XIVHostIORH,PortIO=servera,type=rhiops value=4 1508886060 XIVHostIORH,PortIO=servera,type=whiops value=2 1508886060 XIVHostBW,PortBW=servera,type=tbandw value=0.0185546875 1508886060 XIVHostBW,PortBW=servera,type=rbandw value=0.01171875 1508886060 XIVHostBW,PortBW=servera,type=wbandw value=0.0068359375 1508886060 XIVHostBWHM,PortBW=servera,type=rmbandw value=0 1508886060 XIVHostBWHM,PortBW=servera,type=wmbandw value=0 1508886060 XIVHostBWHM,PortBW=servera,type=rhbandw value=0.01171875 1508886060 XIVHostBWHM,PortBW=servera,type=whbandw value=0.0068359375 1508886060 XIVHostRT,PortRT=servera,type=wlat value=0.293 1508886060 XIVHostRTHM,PortRT=servera,type=wmiss value=0 1508886060 XIVHostRTHM,PortRT=servera,type=whit value=0.293 1508886060 XIVHostRT,PortRT=servera,type=rlat value=0.061 1508886060 XIVHostRTHM,PortRT=servera,type=rmiss value=0 1508886060 XIVHostRTHM,PortRT=servera,type=rhit value=0.061 1508886060
Sending the data to influx
To send the data all we need to do is use the curl command to send the write steam over http. This assumes you have written the above output into a file named data.
POST="http://localhost:8086/write?db=XIV&precision=s"
/usr/bin/curl -i -XPOST ${POST} --data-binary @data
Once we have sent the data we can start to create a dashboard using Grafana. First we need to check we got the data. So in influx we can see we have some data.
> show measurements name: measurements name ---- XIVHostBW XIVHostBWHM XIVHostIO XIVHostIORH XIVHostRT XIVHostRTHM > show series key --- XIVHostBW,PortBW=servera,type=rbandw XIVHostBW,PortBW=servera,type=tbandw XIVHostBW,PortBW=servera,type=wbandw XIVHostBWHM,PortBW=servera,type=rhbandw XIVHostBWHM,PortBW=servera,type=rmbandw XIVHostBWHM,PortBW=servera,type=whbandw XIVHostBWHM,PortBW=servera,type=wmbandw XIVHostIO,PortIO=servera,type=riops XIVHostIO,PortIO=servera,type=tiops XIVHostIO,PortIO=servera,type=wiops XIVHostIORH,PortIO=servera,type=rhiops XIVHostIORH,PortIO=servera,type=rmiops XIVHostIORH,PortIO=servera,type=whiops XIVHostIORH,PortIO=servera,type=wmiops XIVHostRT,PortRT=servera,type=rlat XIVHostRT,PortRT=servera,type=wlat XIVHostRTHM,PortRT=servera,type=rhit XIVHostRTHM,PortRT=servera,type=rmiss XIVHostRTHM,PortRT=servera,type=whit XIVHostRTHM,PortRT=servera,type=wmiss
Creating a first dashboard
In Grafana once logged in as admin/admin we need to add our Data Source for the database. As this is local to my VM it will be localhost where the DB name matches what you created in influx.
[ Select Grafana Logo -> Data Sources ]
After creating the data source we can then create our host based dashboard by selecting home-> New Dashboard.
We now need to select the graph logo and we get the following.
Now click on the word Panel Title and select edit, once in edit mode from the datasource drop down select XIV, we can then build our Query.
We can now edit this to get the data we require at the right had side there is a Toggle Edit Mode which allows us to use text as the SQL i.e.
SELECT mean("value") FROM "measurement" WHERE $timeFilter GROUP BY time($__interval) fill(null)
What we can do is then fill in the appropriate values we want.
IOPS
SELECT max("value") FROM "XIVHostIO" WHERE "type" = 'riops' AND "PortIO" = 'servera' AND $timeFilter GROUP BY time($__interval) fill(null)
Once we have done this we then move to the Display tab and under Stacking and Null value change this to connected.
- Under the General Tab name the Chart Read IOPS
- Under the Axes tab change the unit from short to data rate->IO/s sec
- Finally select at the top back to Dashboard
Once we have a baseline chart we can then duplicate this and change the reads (riops) to writes (wiops) and also total (tiops)
What we then have is as follows.
We can the repeat this for Bandwidth and Latency ( remember to change the Axes to the appropriate metrics megabytes and milliseconds )
BandWidth ( rbandw/wbandw/tbandw)
SELECT "value" FROM "XIVHostBW" WHERE "type" = 'rbandw' AND "PortBW" =~ /servera/ AND $timeFilter GROUP BY PortBW
Latency ( rlat/wlat)
SELECT "value" FROM "XIVHostRT" WHERE "type" = 'rlat' AND "PortRT" =~ /severa/ AND $timeFilter GROUP BY PortRT
What then get the following.
We can the click on the save Icon and name our Dashboard.
Moving forward
With some more work we can then look at adding in multihost using templating and multi arrays. We can also include volumes/HBA and hit and miss statistics.
The wrapper scripts and the parsing script are available here: XIV Performance
Allan,
I have dabbled with Grafana, as well as collecting XIV stats, and the xcli tool – just never all at the same time. I do my current polling via SNMP and some perl scripts, but it is fairly rudimentary via MRTG with RRDtool.
I just discovered your series of posts and am inspired to revisit pulling this data into Influx instead and then presenting via Grafana. Assuming you’re aware of MRTG, you’ll know it defaults to polling data every 5 minutes. I would love more granular than that, and I see the XIV supports interval = 1 for statistics_get, which would be ideal.
However, your scripts look as if they might be setup to run only once per day (grabbing the 1440 data points from the previous day, per array/host/etc)? I’m curious if you’ve explored polling smaller counts more frequently, for a shorter time window.
I’m guessing extracting ALL data via those wrapper scripts every minute might be a bit too stressful on the XIV, but perhaps updating every 15 minutes or hourly would be more practical? My desire is to have current data visible, not just previous days… but I also don’t want to crush the XIV management interface!
Not sure if I should be concerned about that. Similarly, do you run all of your xiv_* wrapper scripts concurrently, or do you spread them out? Any insight you could provide would be much appreciated.
Also, I’ll assume you’ve updated your wrapper scripts since posting, as the date was hard-coded to 2017 😉
Thanks for sharing all of your hard work!
Regards,
Jason
LikeLike
Hey Jason
Thanks for the feedback and note on the scripts 🙂
The xcli allows 1 min as the lowest resolution and if you increase this the data will be averaged over that timeframe.
The scripts are set to just gather a whole day just to make things easy and the run-time for me has never been an issue as I just run it a few times each day with a catch all for the previous day say just after middnight.
As you want to poll the data more frequently the process will complete quicker as the perl parser will have less lines to parse. I generally just run all the scripts concurrently for a single array via a generic wrapper as the majority of these complete quickly where as the hosts,volumes and hba scripts take the longest. One thing to note is that the host port can be 2x more per sample based on servers HBA count. I use this data to make sure we have balance with the I/O policy used on hosts and to diagnose if we have say a imbalance on a zoneset from one fabric to another. This would be evident from the array port data used in the zones not trending closely to each others IOPS. However if your not concerned about those then you could drop host_ports/volumes and gather only when required.
I would say running hourly would be fine with a cut down sample size and I would not expect any load on the array based on a bunch of queries to gather performance data. It really just takes some testing to see how long it takes to complete on your array. You could do an extrapolation based on the xiv_array data as the same amount is returned for all queries based on sample count and should be linear.
Feel free to ping me here or DM me on twitter @Al_unix if you have any issues/questions.
Cheers
Al
LikeLike
A few quick notes:
– I found that statistics_get only needs start *or end* specified. Rather than need to calculate the start of the window, I can just determine current time, and use that as “end.” Then, if I want to poll every 15 minutes, just specify count=15. This simplifies the wrapper date to a single var:
date=`date ‘+%Y-%m-%d.%H:%M:00’`
/opt/IBM_XIV_Storage_Management_GUI/xcli -m $xiv -s statistics_get end=$date count=15 resolution_unit=minute interval=1 host=$host
– There seems to be a delay until the previous minute’s data is fully calculated by the XIV’s management console. I have no idea if there is any consistency, and frankly, not sure I’d trust it to ever be “ready” at a consistent interval. So whether I collect data every 15 minutes or hourly, (I haven’t actually implemented anything yet), it might be best wait a couple minutes (i.e. run via cron at 2,17,32,47) and then subtract two minutes from the current time.
— Related to this, and I’ll research InfluxDB’s doc, if I instead collect 30 minutes of data every 15 minutes, will Influx just ignore any time data that was already collected and stored if it is attempted to be input again? This would address possible lost data during the collection server patch/reboot maintenance window. I could also then not bother with time shifting the collection.
– After connecting once under an account’s context, xcli doesn’t seem to require the username/password to be specified again… or at least that is what I have found. That makes it nice, avoiding the need to specify a password in plaintext within scripts.
Thanks again,
Jason (@kungfoochef)
LikeLike
Hey Jason
As the data is written into influx with a epoch timestamp included from the statistics_get any subsequent writes will just overwrite any data that’s already there.
Al
LikeLike
xiv_port line 34: missing $ before {xiv}
^ not really a problem, just a clean up item.
I’m running into a problem with the “XIV SERVER Host and Array View” dashboard. Specifically the Volume selection.
When I run xiv_vol, it finds 197 volumes and looks like it has entered those into InfluxDB based on the console output. However the dashboard view only displays 55 volumes. I tried adding a regex (Settings -> Variables -> Edit -> Query Options) that matches on volumes not displayed, but the preview fails to show any matches. I’m wondering if its the “show tag values with key = VolIO” — but I’m not sure what else could be used. I did a ‘show measurements’ and tried VolBW instead, same results.
A select from influx (e.g. select * from XIVVolIO where time = 1521086400000000000) only returns values for those 55 volumes, so i don’t know if the entries for the missing vols ever made it into the database or if they have some other issue. My console log during xiv_vol shows “working on ” for those that are missing and indicates results were found, and the influx interface responds as if data was acquired.
Separately, I’m also failing to see any data in the xiv-server-host-and-array-view dashboard. I adjusted the datasource variable, where you had a regex for /XIV|TEST/ or something like that… and it then returns my xiv in the preview. I’d rather focus on the above issue before diving into this one, but if you have any quick thoughts I could check?
Thanks again for publishing this. The acquisition / formatting scripts made this really easy to spin up quickly!
Regards,
J
LikeLike
Hey Jason
The script for volumes will only enter the data if the script finds a time stamp variable in the csv file so it may be that the missing volumes have no i/o. If you can check the statistics get results for one of the missing volumes it should state if there was any data.
LikeLike
Thanks, I will run a few custom get_statistics when I get a chance. There is normally a ton of IO on the missing volumes.
LikeLike