This post will cover some slides from my presentation of the same name which I have presented at a few conferences. I have omitted the tooling setup and graphics of issues as this just shows impact not the reasoning behind it.
In the above slide we can see that the time to get data from rotational media is roughly 8ms for 7200 SATA drives, however when the drives on the array are busy shown as red/orange then this can be higher 20-80ms for a random read.
Bottleneck – The SAN
Here we have an example of a host connected to a switch at 2Gb with 2xISL at 4Gb and the array connected at 8Gb. Above the images we have a example of the egress queues for the switch ports. The ones shown smaller are there to highlight what happens when these get full as you will see in the next few slides.
Back-Pressure / Slow Drain
So at this point when the queues are full at the host then we cannot receive any more data!!!! its HBAs are maxed out (2x2Gb = 400MB/sec) but as we are reading data and this is a non-blocking operation unlike writes which need to wait for the ACK’s coming back the data will still be getting sent by the array.
High Write Latency
We can see the queues above the switches and host images are the transmit(TX) from the array and subsequently the receive(RX) for the host. as Fibre is full duplex then we can send the same amount of data via our TX from the host moving left to right with their own ingress and egress queues.
So how do we get high write latency? For every write we need to wait for our acknowledgement from the array and guess what this needs to be transmitted back to the host.
Now if we are queuing on reads then that ACK moves back along the arrays TX towards the hosts RX and as you can imagine it joins the queues all the way through the SAN.
Bottleneck – Speed mismatch
Speed mismatch is a real problem as a port at 8Gb can send data 2x faster than a port at 4Gb and 4x faster than a 2Gb port.
Now if you are reading data and transmitting at 8Gb/sec then slowing down to 4Gb/sec and again to 2Gb/sec then you can see how this causes queuing and becomes a major issue when you no longer have to wait x ms for data to come from disk you now get it back sub 1ms.
Above we can see hosts connected to the SAN at different speeds.
The key part in the above slide is that many storage admins create ISLs with many links to give an aggregated bandwidth – however the aggregate doesn’t matter. Below is a excerpt from the following cisco document.
“SAN administrators should carefully analyze bandwidth of all individual links, even though multiple links grouped together (as a single port channel) can provide an acceptable oversubscription ratio. By default, the load-balancing scheme of Cisco MDS 9000 Family switches is based on source FCID (SID), destination FCID (DID) and exchange ID (OXID). All the frames of an exchange from a target to a host traverse through the same physical link of a port channel. In production networks, large number of end devices and exchanges provide uniformly distributed traffic pattern.
However, in some corner cases, large-size exchanges can congest a particular link of a port channel if the links connected to end devices are of higher bandwidth than the individual bandwidth of any of the members of the port channel. To alleviate such problems, Cisco recommends that ISLs should always be of higher or similar bandwidth than that of the links connected to the end devices.
© 2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page 7 of 61″
So when your SAN team say they have aggregate of 16GB/s based on 4x4Gb ISL’s and your connected at 8Gb then your going to be having major problems.
This is a flow chart which we use when looking at host latency issues. Again if you gather stats for your hosts you can use a similar approach, I have covered off how to create these views in previous posts for 3PAR and IBM Arrays.
Above shows a VM maxing out an ESX node and subsequent OS QOS being put in place.