Recently I had been asked to look at some performance issues on a bunch of VMs which were showing increased write latency. But how do you know where the VM is sending it’s I/O in such a distributed system? With VSAN the compute resource for a VM could be on one server where all I/O is being serviced on say 2-3 others in the same cluster. This makes troubleshooting difficult.
So work was started to see if I could make sense of this and map the VSAN environment which is one of the first things we do when looking at performance issues is to map your environment.
Datastores and disk groups.
I will caveat this firstly that I am not a VMware admin however this is my understanding of how VSAN is setup which may not be 100%.
An example is that we have x nodes in a cluster which have local drives in this example we will have 1x SSD ( Flash) and 3x HDD(SAS) drives which make up a disk group (Hybrid).
We create a disk group from the disks and add these into the VSAN cluster which creates a distributed datastore over all the nodes & disk groups which is essentially VSAN.
When we create a VM we basically chunk up the size of the VM disk/s and distribute these as evenly as possible over the disk groups in the cluster to maintain a specific RAID/Fault policy depending on the size of the VM.
Components
Components are basically chunks of a object (VMDK) which have a storage policy applied. This policy dictates the number of copies of the data. For a more complete reference I would point people at Cormac’s blog here
Given the distributed nature of VSAN where a VM can be running on one node in the cluster but its I/O i.e. where it’s writes will land on disk can be on other nodes how do we know where the VM is writing too?
In Cormac’s blog above there is an image which shows the VM mapping of components within the VSphere Client. These can be exported to csv from the client or gathered from the RVC.
A text example looks like this.
Type,Component State,Host,Fault Domain,Cache Disk Name,Cache Disk Uuid,Capacity Disk Name,Capacity Disk Uuid Witness,Active,esx00102.example.com,,Local ATA Disk (naa.55cd2e404b78c32c),524aaffe-0921-e52c-3df6-250df7b47977,SEAGATE Serial Attached SCSI Disk (naa.5000c500634b1947),5293836e-9d8b-36f9-6169-73dbe6ea9675 RAID 1,,,,,,, Component,Active,esx00101.example.com,,Local ATA Disk (naa.55cd2e404b78c31d),5265ad81-7d7c-1dbb-48c6-01cb1ac25644,SEAGATE Serial Attached SCSI Disk (naa.5000c500634935f7),528672b1-74dd-a4b4-9e92-3267fde4d526 Component,Active,esx00100.example.com,,Local ATA Disk (naa.55cd2e404b78c3b5),52be7206-e309-5ff0-00ff-7d7d8bfeac9f,SEAGATE Serial Attached SCSI Disk (naa.5000c500634cf6d3),523c39b8-4f4a-5295-4f9b-84124b570fe1 Witness,Active,esx00100.example.com,,Local ATA Disk (naa.55cd2e404b78c3b5),52be7206-e309-5ff0-00ff-7d7d8bfeac9f,SEAGATE Serial Attached SCSI Disk (naa.5000c500634a7693),524285f4-e601-8b50-42c1-bf104fec49e9 RAID 1,,,,,,, Component,Active,esx00102.example.com,,Local ATA Disk (naa.55cd2e404b78c32c),524aaffe-0921-e52c-3df6-250df7b47977,SEAGATE Serial Attached SCSI Disk (naa.5000c500634b50db),52d9940c-a76a-05ee-77a9-89ed00f04377 Component,Active,esx00101.example.com,,Local ATA Disk (naa.55cd2e404b78c31d),5265ad81-7d7c-1dbb-48c6-01cb1ac25644,SEAGATE Serial Attached SCSI Disk (naa.5000c500634935f7),528672b1-74dd-a4b4-9e92-3267fde4d526
So this is great I can see which ESX node the Component is on and which SSD and HDD that will take the I/O for that component.
Some quick parsing later gives us a mapping i.e.
esx00100.example.com ssd_55cd2e404b78c3b5 hdd_55cd2e404b78c3b5 TESTVM05 esx00101.example.com ssd_55cd2e404b78c31d hdd_55cd2e404b78c31d TESTVM05 esx00101.example.com ssd_55cd2e404b78c31d hdd_55cd2e404b78c31d TESTVM05 esx00102.example.com ssd_55cd2e404b78c32c hdd_55cd2e404b78c32c TESTVM05
So now we have a mapping of the VM to SSD and HDD’s we can start to build up a picture as follows.
[CLUSTER] -> [ESX NODE] -> [ SSD ] -> [ HDD ] -> [ VM...COMPONENT ] -> [ HDD ] -> [ VM... ] -> [ HDD ] -> [ VM... ]
Below shows the HDD and number of components that it contains, this can be built with a simple grep command combined with sort and uniq.
['esx00100.example.com','hdd_5000c500634a7693',36], ['esx00100.example.com','hdd_5000c500634b54d7',34], ['esx00100.example.com','hdd_5000c500634cf6d3',30], ['esx00100.example.com','hdd_5000c500634d2573',38], ['esx00101.example.com','hdd_5000c500634935f7',54], ['esx00101.example.com','hdd_5000c50063493ed7',56], ['esx00101.example.com','hdd_5000c500634b4c53',56], ['esx00101.example.com','hdd_5000c500634d0613',59], ['esx00102.example.com','hdd_5000c500634af28f',54], ['esx00102.example.com','hdd_5000c500634b1947',43], ['esx00102.example.com','hdd_5000c500634b50db',57], ['esx00102.example.com','hdd_5000c500634d0057',43],
With this data we can also do the following mappings
- CLUSTER to ESX
- ESX to HDD
- SSD to HDD ( disk group )
- HDD to VM
- SSD to VM
We can see above that each HDD has a number this is the number of components that are against each HDD in the mapping. We do the same for the SSDs.
grep ssd mapping |awk '{print $2,$3}' |sort |uniq -c |awk '{printf ("[\x27%s\x27,\x27%s\x27,%d],\n",$3,$2,$1)}' ['hdd_5000c500634935f7','ssd_55cd2e404b78c31d',54], ['hdd_5000c50063493ed7','ssd_55cd2e404b78c31d',56], ['hdd_5000c500634b4c53','ssd_55cd2e404b78c31d',56], ['hdd_5000c500634d0613','ssd_55cd2e404b78c31d',59], ['hdd_5000c500634af28f','ssd_55cd2e404b78c32c',54], ['hdd_5000c500634b1947','ssd_55cd2e404b78c32c',43], ['hdd_5000c500634b50db','ssd_55cd2e404b78c32c',57], ['hdd_5000c500634d00b57','ssd_55cd2e404b78c32c',43], ['hdd_5000c500634a7693','ssd_55cd2e404b78c3b5',36], ['hdd_5000c500634b54d7','ssd_55cd2e404b78c3b5',34], ['hdd_5000c500634cf6d3','ssd_55cd2e404b78c3b5',30], ['hdd_5000c500634d2573','ssd_55cd2e404b78c3b5',38],
You can see we just do a sort | uniq -c for the count of how many times say the SSD appears in the list with the same HDD for a component.
What’s great about this mapping is that we can quickly identify other VMs which could be experiencing issues. Say we get a call saying vm001 is showing high write latency, we can quickly tell from the map what other components(VMs) are on the same HDD/SSDs as vm001 this allows us to check those VMs to see if they are also experiencing latency issues or if the issue is possibly contained to just a single VM.
So What does this look like when we put this all together in a Sankey diagram. What we can see is that the wider the flow the more components are being serviced by the ESX Node /SSD/HDD.
Below is the basic google charts Sankey Diagram with the above mapping data format.
We can also drill down on a SSD and see what VM’s have components on it.
We can view a single HDD to VM mapping.
And from a VM back to HDD/SSD.
As nice as this looks it doesn’t scale for a larger VSAN clusters, we can use this view for more detailed mapping of single component data where we can replace the number of components with the 95th percentile for BW or IOPS/latency so we can see whats driving the workload.
What Next? SunBurst Chart
With a zoom-able sunburst chart this allows us to drill down into components, nodes, ssds and hdds. This scales well for a large cluster.

Looking at the image above this is built using the mapping where we have [ESX Node] -> [SSD] -> [HDD] -> [VM]-> [VM Size] for every component. We also break down the VM disk components into individual disks. That allows us to gather performance metrics and map these back to the appropriate component name i.e. vm11000_0 vm11000_1 which shows vm11000 has two disks these would be seen as scsi:0:0 and scsi:0:1.
So what does the chart really show us? Well on the outer ring we have the VM components, moving inward from the edge we have the HDDs that the component uses, then the SSD which make up the disk group then the ESX node.
This type of chart again can be changed to show IOPS/BW/Latency for the VMs etc instead of size.
The size of the ESX node/SSD/HDD/VM is dependent on the size value so the larger the colour block the more work, i.e. IOPS etc.

In the above image for write IOPS we can see that we have one VM which issuing the majority of write IOPS which was udgra010_1 this also is the second disk on the VM. We can also see where the components reside based on the storage policy here we have 3 based on FTT2.
Clicking on the ESX node 00011 in orange the chart zooms in and we have the following. We could also then zoom into the SSD/HDD as well breaking this down further if needed.

Here we can see the split of the two disk groups where one SSD(cache) is doing more work that the other Disk group. We can also see the two VMs contributing to the workload. What is also good with this view is that we can easily gather the other components that could be affected if we had a performance issue with any of the drives etc.
The chart above is based on the following gist
The mapping as before is in the following format [ESX Node],[SSD],[HDD],[VM],[Metric].
esx00008,ssd_50000396dc8906b1,hdd_5000039628d0c735,un000903_0,780.509526
esx00011,ssd_500003973c8ab3d9,hdd_5000039628d0ea6d,un000903_0,780.509526
esx00003,ssd_50000396dc8a35a9,hdd_5000c50084cab73b,un000903_0,780.509526
This mapping converts to the following flare.json file this can be used with the gist to create the view.
{
"name": "flare",
"children": [
{
"name": "esx00008",
"children": [
{
"name": "ssd_50000396dc8906b1",
"children": [
{
"name": "hdd_5000039628d0c735",
"children": [
{
"name": "un000903_0",
"size": 780.509526
}
]
}
]
}
]
},
{
"name": "esx00011",
"children": [
{
"name": "ssd_500003973c8ab3d9",
"children": [
{
"name": "hdd_5000039628d0ea6d",
"children": [
{
"name": "un000903_0",
"size": 780.509526
}
]
}
]
}
]
},
{
"name": "esx00003",
"children": [
{
"name": "ssd_50000396dc8a35a9",
"children": [
{
"name": "hdd_5000c50084cab73b",
"children": [
{
"name": "un000903_0",
"size": 780.509526
}
]
}
]
}
]
}
]
I will be exploring this further in the next few months and also looking at automating the gathering of performance metrics etc to build these charts.