Using Grafana & Inlfuxdb to view XIV Host Performance Metrics – Part 2 Bringing it all together.

Overview

In the first blog in this series we added in IOPS and explained how we could look to add in Bandwidth and Latency. In this post we will cover off creating the whole view and adding in the next post we will cover adding in volumes and server host bus adapter details.

Bringing it all together

There are a few components to the view I will detail these below.

IOPS

Basically how many Input Output operations we are doing a second. This is broken down into three categories reads, writes and total.

The following SQL is used throughout the dashboard.

SELECT "value" FROM "XIVHostIO" WHERE "type" = 'riops' AND "HostIO" =~ /$HOST/ AND $timeFilter GROUP BY HostIO
SELECT "value" FROM "XIVHostIO" WHERE "type" = 'wiops' AND "HostIO" =~ /$HOST/ AND $timeFilter GROUP BY HostIO
SELECT "value" FROM "XIVHostIO" WHERE "type" = 'tiops' AND "HostIO" =~ /$HOST/ AND $timeFilter GROUP BY HostIO

Breaking this down we select the value which is what we used when writing the data i.e.

XIVHostIO,HostIO=servera,type=tiops value=6 1508886060

We also specify the type which is riops (read IOPS ) we tag the server as HostIO which also allows us to use the same query syntax when looking at the whole array.

We have $HOST which is passed in as a template from a query from Influx, this allows us to select multiple hosts and group by their name. I will explain how to create a lookup below in the template section.

BW

Bandwidth basically is the amount of data we are reading or writing to the array. This again is broken down into reads, writes and totals. This value is measured in megabytes/sec.

You will notice the query is similar to the IOPS except we select the tag of rbandw etc.

SELECT "value" FROM "XIVHostBW" WHERE "type" = 'rbandw' AND "HostBW" =~ /$HOST/ AND $timeFilter GROUP BY HostBW
SELECT "value" FROM "XIVHostBW" WHERE "type" = 'wbandw' AND "HostBW" =~ /$HOST/ AND $timeFilter GROUP BY HostBW
SELECT "value" FROM "XIVHostBW" WHERE "type" = 'tbandw' AND "HostBW" =~ /$HOST/ AND $timeFilter GROUP BY HostBW

Latency

Latency is measured in milliseconds , and we break this down into read (rlat) and write ((wlat) latency.

SELECT "value" FROM "XIVHostRT" WHERE "type" = 'rlat' AND "HostRT" =~ /HOST/ AND $timeFilter GROUP BY HostRT
SELECT "value" FROM "XIVHostRT" WHERE "type" = 'wlat' AND "HostRT" =~ /HOST/ AND $timeFilter GROUP BY HostRT

Hit & Miss Statistics

What are hit and miss statistics? well this is very easy to understand. When we write to an array we always write to memory called cache this allows us to speed up the writes to the array without having to worry about the physical destage (write) to disk.

We also get the benefit that if we are reading data a lot this can be held in the array cache giving a Hit this also improves performance as we do not have to wait on the read from physical disk and the delays involved with that operation (a miss).

Hit = We get a read or write from the array memory.

Miss = We have to read or write to physical disk.

So why do we care about showing these versus general reads and writes. Well we care because a miss is significantly slower than a hit. If you think that a Hit has a cost of 1ms a miss can be upwards of 8-20ms on random I/O and if we are reading a lot of data you can see why this would impact the performance of the application requesting data.

An example could be a database used by a load test suite which may not be used very often where we need to read data and measure performance of the application based on the time for queries to complete.

On the first run or subsequent initial runs and we assume that no data is in cache we would need to read this from physical disk which would show as a slow during the test. When we re-test when all data has been read into cache we get say 1ms response times which then shows as an increase in performance. This is very relevant if you run with storage replication and fail-over a service to a different data centre you may experience poor performance until the cache warms up which these stats allow us to see the impact.

Read/Write Hit and Miss IOPS

SELECT "value" FROM "XIVHostIORH" WHERE "type" = 'rhiops' AND "HostIO" =~ /$HOST/ AND $timeFilter GROUP BY HostIO
SELECT "value" FROM "XIVHostIORH" WHERE "type" = 'rmiops' AND "HostIO" =~ /$HOST/ AND $timeFilter GROUP BY HostIO
SELECT "value" FROM "XIVHostIORH" WHERE "type" = 'whiops' AND "HostIO" =~ /$HOST$/ AND $timeFilter GROUP BY HostIO
SELECT "value" FROM "XIVHostIORH" WHERE "type" = 'wmiops' AND "HostIO" =~ /$HOST$/ AND $timeFilter GROUP BY HostIO

So we select data from the XIVHostIORH table where the type is rhiops (read hit iops) and rmiops (read miss iops), whiops (write hit iops ) and wmiops (write miss iops)

One thing to point out is a write hit is an overwrite of data already in cache and a miss is a new slot being written too.

** Key metric here is rmiops which is the amount of reads we need to get from physical disk.

Read/Write Hit and Miss BW

As with IOPS we do the same for Bandwith.

SELECT "value" FROM "XIVHostBWHM" WHERE "type" = 'rmbandw' AND "HostBW" =~ /$HOST/ AND $timeFilter GROUP BY HostBW
SELECT "value" FROM "XIVHostBWHM" WHERE "type" = 'rhbandw' AND "HostBW" =~ /$HOST/ AND $timeFilter GROUP BY HostBW
SELECT "value" FROM "XIVHostBWHM" WHERE "type" = 'wmbandw' AND "HostBW" =~ /$HOST/ AND $timeFilter GROUP BY HostBW
SELECT "value" FROM "XIVHostBWHM" WHERE "type" = 'whbandw' AND "HostBW" =~ /$HOST/ AND $timeFilter GROUP BY HostBW

So we select data from the XIVHostBWHM table where the type is rmbandw (read miss mb) and rhbandw (read hit mb), wmbandw (write miss bw) and whbandw (write hit mb)

** Key metric here is the read miss BW as this would show the amount of data we would be needing to read from physical disk.

Read/Write Hit and Miss Latency

Keeping in with the other queries let’s be honest they are all the same except where we read the data from and the tagging.

SELECT "value" FROM "XIVHostRTHM" WHERE "type" = 'rhit' AND "HostRT" =~ /$HOST/ AND $timeFilter GROUP BY HostRT
SELECT "value" FROM "XIVHostRTHM" WHERE "type" = 'rmiss' AND "HostRT" =~ /$HOST/ AND $timeFilter GROUP BY HostRT
SELECT "value" FROM "XIVHostRTHM" WHERE "type" = 'whit' AND "HostRT" =~ /$HOST/ AND $timeFilter GROUP BY HostRT
SELECT "value" FROM "XIVHostRTHM" WHERE "type" = 'wmiss' AND "HostRT" =~ /$HOST/ AND $timeFilter GROUP BY HostRT

Again we are reading from XIVHostRTHM where we have types of rhit (read hit) , rmiss ( read miss) , whit (write hit) and wmiss (write miss) latency’s.

** Key metric here is rmiss latency this is where you will see the pain and if you are experiencing poor performance.

Templating

When we want to look at multi-host details or multi-volume details we need to create a template within Grafana which gives us the drop down menus.

Servers

Lets run through this setup.

We give it a Name ( this is $HOST in our queries ) and we need the query.

show tag values with key = HostIO

Once we have that the rest is really pretty self-explanatory.

Array

You can see that I have a template for the Array which is set as follows.

Volumes

For Volumes we have the following template.

The Completed View

Scroll down and we get the rest of the view.

The JSON template for Grafana XIV Host Dashboard is located on my github repository.

** UPDATE 06/11/2017

The main perl script (graph_xiv_data.pl) and the wrapper scripts in the github repository have been updated to align each type with the type of stat we are gathering.

In the wrapper scripts we now pass in Port, HPort, Host and Port as flags to the perl script and the SQL will now reflect this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s