3PAR Performance Monitoring – Cache & DelAck

Overview

This series of posts will look at HPE 3PAR performance monitoring using the main stat commands for realtime analysis and culminate in a similar Grafana/Influxdb solution to look at historical statistics.

I will be covering the following key areas:

  • Cache
    • statcmp
  • Cpu
    • statcpu
  • Hosts
    • statvlun / statvlun -hostsum
  • Ports
    • statport -host
  • Volumes
    • statvlun -vvsum

I will show an overall Array statistics view similar to the XIV array overview where we will show all hosts attached with a view to analysing possible noisy neighbour workloads and monitoring performance.

Getting started..

How do I run the statcmp command? Below shows how to setup a passfile so we dont need to input a password for our commands.

 

# PATH=$PATH:/opt/hp_3par_cli/bin/
# setpassword -saveonly -file mypass.your3par 
system: your3par 
user:  
password:

Cache

So let’s get started what is Cache? Well its the memory region within the storage array that we write into before de-staging to disk. This is also a dual purpose Cache where we can service reads for faster access.

Lets look at the columns which are made available to us via the statcmp command.

col,statcmp – Explanation for the column headers for statcmp

  • Node
    • Node Id on the storage system
  • Type
    • Read or Write
  • Accesses
    • Number of Current and Total Read/Write I/Os
  • Hits
    • Number of Read/Write I/Os in which data was already in cache
  • Hit%
    • Hits divided accesses displayed in percentage
  • Free
    • Number of cache pages without valid data on them
  • Clean
    • Number of clean cache pages (valid data on page)
    • A page is clean when data in cache matches data on disk
  • Write1
    • Number of dirty pages that have been modified exactly 1 time
    • A page is dirty when it has been modified in cache but not written to disk
  • WriteN
    • Number of dirty pages that have been modified more than 1 time
  • WrtSched
    • Number of pages scheduled to be written to disk
  • Writing
    • Number of pages being currently written by the flusher to disk
  • DcowPend
    • Number of pages waiting for delayed copy on write resolution
  • DcowProc
    • Number of pages currently being processed for delayed copy on write resolution
  • LockBlk
    • Number of pages being modified by host I/O that are being written to disk by the flusher
  • CfcDirty
    • Current number of dirty cluster memory pages in the system for the specified device type class
  • CfcMax
    • Maximum allowed number of dirty cluster memory pages in the system for the specified device type class
  • DelAck
    • Number of delayed acknowledgements to the host in order to throttle the host’s IO writes due to cache resource constraints for the specified device type class

Looking at the above one metric stands out to me that can be an indication that we are having a performance problem, that metric is DelAck.

DelAck

As noted by its description the delayed acknowledgement tells the hosts that the cache is getting full and we need to de-stage to make available cache slots. If we think a bit more about this we could saturate the cache by having a high write workload. This is a key metric that we actively trend on and historically when we have seen an increase in DelAcks we in turn see high response times on all hosts connected to the array node.

I like to think of this in a similar vein as the linux page cache which has two thresholds the dirty_ratio and the dirty_background_ratio. When the dirty_background_ratio is reach as a % of memory we start to de-stage data from cache to disk more actively than the expire and writeback time. When we get to the actual dirty_ratio we start to have force data from cache to disk which can be a cause for poor performance.

Delayed Ack is also tracked per Node depending on your system in my example we have an all flash array with four nodes.

Statcmp command

We have below an example from the cli help.

EXAMPLES
  The following example displays one iteration of CMP statistics for
  all nodes:

  cli% statcmp -iter 1

  16:13:49 03/18/11 ---- Current ----- ---------- Total -----------
  Node    Type   Accesses Hits Hit% Accesses Hits Hit% LockBlock
  0       Read        0      0    0       0   0   0           21
  0       Write       0      0    0       0   0   0           22
  1       Read        0      0    0       0   0   0           23
  1       Write       0      0    0       0   0   0           24
  Queue Statistics
  Node    Free    Clean Write1 WriteN WrtSched Writing DcowPend DcowProc
  0       473405 11790       0      0        0       0        0       0
  1       483013 2158        0      0        0       0        0       0
  Temporary and Page Credits
  Node Node0 Node1 Node2 Node3 Node4 Node5 Node6 Node7
  0    0  33411    --- ---    ---   ---    ---   ---
  1 33482 0        --- ---    ---   ---    ---   ---
  Page Statistics
  ---CfcDirty--- ------CfcMax------- ----DelAck----
  Node FC FC15 NL SSD     FC FC15     NL SSD FC FC15 NL SSD
  0    15 0    17 0   19200   0     9600 0   11   0  13 0
  1    16 0    18 0   19200   0     9600 0   12   0  14 0
       Press the enter key to stop...

Looking at the example from the help within the CLI we really want to grab the lines under Page Statistics.

Parsing the output

So we have a small awk script that can be used to parse this data in realtime.

$ statcmp -pwf ./mypass -sys yourpar -iter 10 -d 1 |awk -f ./awkcache
    Time Node     %WP_USED       WP_MAX      DEL_ACK
15:31:30    0         0.21        38400        37937
15:31:30    1         0.13        38400         8760
15:31:30    2         0.16        38400        62789
15:31:30    3         0.30        38400        22226
15:31:31    0         0.31        38400        37937
15:31:31    1         0.07        38400         8760
15:31:31    2         0.25        38400        62789
15:31:31    3         0.24        38400        22226
15:31:32    0         0.18        38400        37937
15:31:32    1         0.05        38400         8760
15:31:32    2         0.08        38400        62789
15:31:32    3         0.26        38400        22226
15:31:34    0         0.22        38400        37937
15:31:34    1         0.02        38400         8760
15:31:34    2         0.03        38400        62789
15:31:34    3         0.27        38400        22226
15:31:35    0         0.10        38400        37937
15:31:35    1         0.02        38400         8760
15:31:35    2         0.02        38400        62789
15:31:35    3         0.04        38400        22226

Great so lets explain the output.

  • %WP_USED
    • What is the number of Dirty pages as a percent of CfcMax
  • WP_MAX
    • Max amount of cache.
  • DEL_ACK
    • Number of delayed acknowledgements

** What to track — DEL_ACK  (Historical) & %WP_USED (realtime > 80%) which gives us an indication of the amount of cache used***

The Parsing Script

BEGIN {printf("%8s %4s %12s %12s %12s\n","Time","Node","%WP_USED","WP_MAX","DEL_ACK");}
{
 if( $0 ~ /:/ ) { time=$1}
 if( $0 ~ /RPM/ ) {
 t=0;
 }
 if (( $0 ~ /^ [0-3]/ ) && ( t <=3)) {
 printf("%6s %4d %12.2f %12d %12d \n",time,$1,(($5/$9)*100),$9,$NF);
 t++;
 }

if ($0 ~ /^$/ ) {
 t=5;
 }
}

We can also use this script with a file by doing the following.

# awk -f ./awkcache statcmp.out |more
    Time Node     %WP_USED       WP_MAX      DEL_ACK
16:00:21    0         0.01        38400        17089
16:00:21    1         0.00        38400        15746
16:00:21    2         0.03        38400         4949
16:00:21    3         0.00        38400         3724
16:00:36    0         0.00        38400        17089
16:00:36    1         0.00        38400        15746
16:00:36    2         0.03        38400         4949
16:00:36    3         0.00        38400         3724
16:00:51    0         0.01        38400        17089
16:00:51    1         0.00        38400        15746

Cause of a DelAck

Plain and simple and combined large write workloads can cause issues.

Impact of a DelAck

So what does the impact of a DelAck look like to your hosts? Well in general all hosts connected to the node will see an increase in latency. You will see a drop in all I/O to and from the array ports and an increase in array port queuing.

DelAck Example

Below we can see we have an increase in the DelAck counters accross all nodes.

Lets zoom in and take a look at the BW being issued to the array.

Above highlighted in red that we have a spike from 00:00 – 00:04, lets break this down into write BW for all hosts and see what that looks like.

We can see have many hosts pushing about 200MB/sec in writes which may not look out of the ordinary so lets stack these hosts and see what that looks like.

Now we can see that hosts combined are pushing up to 3.4GB/sec in write BW into the array which is the cause of the saturation. Once identified we would need to look at what hosts were issuing the writes and perform workload analysis.

Closing remarks

Thankfully cache saturation and subsequent DelAcks are rare but given the impact it is one of the first things I have on my checklist when looking at any performance issue on the 3PAR arrays.

I hope this helps you understand the type of metrics that you can gather and an insight into the future posts showing how we can build array views to help identify these issues.

3 thoughts on “3PAR Performance Monitoring – Cache & DelAck

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s