Network performance expectations and troubleshooting
Overview
Once you have run your network performance tests to each source and target environment, you should confirm that they meet expectations. Corporate networks commonly leverage 10Gbps "line speeds" between servers. Although networks are shared environments, switching infrastructure is typically used to isolate traffic between different hosts, with the goal of allowing each host to reach its potential. While many environments perform within the 70-90% range of line speed, in well-maintained environments we can see 90%+ line speed.
In some circumstances, you can exceed typical expectations by implementing more best practice recommendations – for example, putting two VMs on the same hypervisor (e.g. ESX) host, in the same chassis or blade enclosure, or when teamed (bonded) NIC cards are used for extra bandwidth. Furthermore, when Jumbo Frames are implemented correctly, they provide a substantial improvement.
If your network is not meeting expectations, you will almost always need help from someone whose role is focused on networking to help arrive at a root cause. You may also need to obtain temporary access to other VMs and/or physical hosts to isolate some issue(s). Additionally, you may need the help of Delphix Support or Professional Services to perform some tests from the Delphix Engine to systems that are not already connected environments within the Delphix product. (The CLI test will only work for environments connected to the Delphix Engine).
With assistance or not, you can do a number of things to narrow down the potential causes of poor performance between two systems by a process of elimination. Below are some high-level steps to consider. Keep in mind that network throughput will always represent the least performant component, so many of the steps below are intended to help isolate which component may be performing poorly.
Troubleshooting and information gathering questions in rough order of priority - record and share your answers
Have all source/target tuning settings been applied?
Is AIX in scope? LSO / LRO can have a significant impact, as can Jumbo Frames (see below)
What is the link speed on the hosts in question? Is NIC teaming / bonding / LACP in use?
Linux: ethtool <device>
Solaris: dladm show-phys
Windows: wmic NIC where "NetEnabled='true'" get "Name,Speed"
What are the test results with greater or fewer connections in parallel?
Can we test throughput to alternate servers? (See below)
What is the overall latency? What is the latency to each (OSI layer 3) hop? Is there one hop that consistency has a higher cost? (check the latency with ping and hops with traceroute)
How many devices (OSI layer 2) are in the path? Your network team will need to help you identify these devices.
Note: Only Layer 3 devices will show up when reviewing a traceroute however each layer two devices can impact traffic and each needs to be configured when implementing jumbo frames
Example devices in path to physical server: 1. virtual NIC -> 2. Virtual switch (ESX) -> 3. Chassis NIC -> 4. Rack switch -> 5. Core Switch -> 6. Rack Switch -> 7. Physical NIC
Example devices in path to virtual server: 1. virtual NIC -> 2. Virtual switch (ESX) -> 3. Chassis NIC -> 4. Rack switch -> 5. Chassis NIC -> 6. Virtual Switch (ESX) -> 7. Virtual NIC
What is the average network utilization on each hop? Is there congestion on any hop? (Network team will need to review)
Is QoS / VirtualConnect / 802.1p enabled? At what threshold will it engage? (Network team will need to review)
Is there a firewall or any deep packet inspection in the route? (Network team will need to review)
Are Jumbo Frames enabled on any or all hops? E.g. Delphix Engine, Virtual NIC, Virtual switch, and all hops down to the destination. (Network team will need to review)
We have seen Delphix installations often benefit 10-20% from Jumbo frames, but certain platforms (such as AIX) can benefit much more dramatically
Note that JF enablement on two hosts without confirming all the network pieces are properly enabled will result in VERY poor performance
Test Jumbo Frames via with “Do Not Fragment” flag from the remote host to the Delphix Engine.
Note that typical MTU Jumbo Frame setting is 9000 bytes, although some vendors recommend a little above or below this.
The test below is at 8000, but you can test larger from there. Our goal is primarily to ensure that a number substantially larger than 1500 and somewhat close to the 9000 "de facto" standard is working.
Whenever two hosts connect, they perform a handshake called Path MTU negotiation, where they agree on the highest MTU they both support. This is how we avoid impact when communicating to hosts with differing MTUs.
Linux$ ping
-Mdo
-s 8000 [Delphix_Engine_IP]Windows> ping -f -l 8000 [Delphix_Engine_IP]
Solaris v10-# traceroute -F [Delphix_Engine_IP] 8000 (
"Do Not Fragment" not supported by ping on Solaris until v11)Solaris v11+# ping -s -D [Delphix_Engine_IP] 8972
Depending on the results above, a dedicated network or VLAN may help. Consider if that is an option for you. (Your network team will need to review)
Testing throughput testing to alternate servers
This will help isolate where a problem may be.
Delphix to Server A – already known
Delphix to Server B – physical; helps us see if there is a problem with the original server NIC or physical network settings
Server B to Server C – physical; helps us see if there is a problem with the Delphix server NIC or physical network settings
Delphix to Server D – virtual; helps us see if there is a problem with the virtual network or Delphix settings
Server D to Server E – virtual; same host; helps us see if there is a problem with the virtual network on the host
Server D to Server F – virtual; different host; helps us see if there is a problem with the virtual network
Conclusion
If you need further help, please contact Delphix Support or Professional Services to assist in getting the best performance possible from your environment.