Measuring network throughput through a Virtual Machine instance in the Public Cloud is not an exact science simply because of the unknowns involved. Contrast this to the private cloud, where all the devices and the network are usually under the purview of the person conducting the measurement. Here are a few differences:
Specific details of the deployment such as NUMA Socket and NIC card mapping where instances are deployed is abstracted
BIOS settings are non-negotiable and hence, optimal settings for throughput based on the virtual instance are hard to achieve
Inter-connectivity between two VM instances is unclear – are they hosted on the same server? Are they on different physical hosts? How many switches are being traversed in the path? All of these are either unknown or difficult to control
The traffic generator tools themselves are a function of the virtualization settings similar to the device being measured for throughput
Physical servers and the deployment specifics for the Guest OS can be controlled
Since the servers are in-house, all settings for optimal throughput can be set and controlled, such as Hyperthreading, HugePage sizes, etc.
Every detail of the topology used for measurement is under the user control and hence, is easy to manage/ change to achieve high throughput
The traffic generator can be a physical device, guaranteeing that the packets are being sent at a prescribed rate
Each of the major public cloud providers, AWS, Azure and GCP do mention the maximum throughput achievable with each instance type. But these numbers are not guaranteed – they are usually “up to 10/ 25/ 50 Gbps” and are under ideal test conditions. As we know, the number of deployments in each Region can be huge. Another fact – as the number of instances in a single host increase, they compete for the available physical resource of a NIC card to forward traffic on. Therefore the expected and actual measured throughput can be quite different.
With all of these gotchas, there is no de facto standard method for measurement of throughput through an instance deployed in the cloud. With that said, here are my top 6 recommendations for measuring throughput:
- Identify the scenario that will be deployed. Will the traffic be reaching the instance under test from the Internet? Or is it going to be receiving traffic from another VM instance within the same region, another region; same or different Availability Zone, etc. The measurement needs to be tailored to the scenario that the deployment will be in.
- Create a dummy test environment. Most deployments for security instances (such as a Next-Gen Firewall) are as a termination point for a VPN connection from on-prem devices. The vagaries of the Internet connectivity adds a different dimension to the throughput available. The only way to get as close as possible to the expected network throughput, is to deploy a test network mimicking the final deployment and conduct the measurement tests on that network.
- Rely on open-source tools. These may not provide the best or ideal throughput available based on the internal architecture of the OS due to their inherent limitations and their dependency on the underlying Linux kernel. Currently, open source tools are the best option available until any of the IaaS providers introduce a tool for effective, reliable and repeatable measurements.
- Run the measurements a few times to account for variations. From experience, I know that even with all the best laid plans, the throughput measurements can vary between executions. Use your preferred statistical approach to decide on the throughput that will be eventually reported.
- Repeat the measurements regularly to account for any increased usage in a specific region over time.
- Decide your Use Case and Costs. AWS and Azure provide SR-IOV enabled instances, which are more expensive than other, standard instances. If the use case justifies the higher costs, then ensure that the Guest OS is capable of leveraging the technology and use it to measure the network throughput.
Debugging for achieving the maximum available throughput can be a nightmare with the number of variables involved, starting with the host machine, the OS, the regions and the network traffic. As with all debug approaches, begin by peeling the onion and ruling out each variable, one at a time. This could be by trying in different regions and different instance types. Stripping down the OS to a bare minimum and adding layers on top of it, if available, is another approach to take. It can be time consuming and automation would be your best friend along this path of network throughput measurement.
I found that Google has helpfully published a method for calculating network throughput. I couldn’t find similar articles from AWS or Azure, but I’d expect the approach towards network bandwidth measurement would be much the same.
What are your thoughts on network throughput measurement techniques? Any better methods that exist today? Join in the conversation by leaving your comments below.