Disabling Failback On VMware dvSwitch: Why You Should Consider

During some recent testing with our 1.0 cloud configurations in our lab, we noticed some behavior in failure scenarios that was pretty surprising.

The scenario is this. You have a VMware based cloud pod running in an Active/Passive configuration (Some of our original cloud pods were implemented before vPC was supported on the Cisco Nexus platform). You have a reboot of one Nexus switch. Everything fails over to the other Nexus switch and it works perfectly. You lost at most one packet. You’re happy and everyone breathes a sigh of relief. Then the failed Nexus switch comes back up. Your monitoring platform starts going crazy with alerts for unreachable VMs. After a few minutes, everything seems to have settled down and it’s all working again. Then everyone asks, “What the heck just happened?”

What just happened is that your Nexus switch that failed was a primary for some of your ESX hosts. When that primary path comes back, if Failback is set to YES, it will fail traffic back over to the primary path. Here is the kicker. VMware considers that path active again as soon as it sees the link up, regardless of whether the port is actually in a forwarding state. So when that Nexus switch comes back up, there will be a brief period of time where the ESX host is sending traffic and it’s not going anywhere. This could lead the host to consider itself isolated which would of course lead to whatever action you’ve chosen if the host becomes isolated (preferably you have chosen “Leave Powered On”).

This, by the way, is a reason portfast is recommended in ESX environments. It’s also why portfast is enabled by default when you set the port mode to EDGE on a Nexus for your ESX host ports.

We noticed this behavior before we actually figured all this out in the lab so one thought was that we would manually shut down each ESX host port on the Nexus, one at a time. We would then reboot the Nexus and since the ports were shut down anyway, the traffic would not fail back over. Again, this was before we learned that VMware considered the path active just because it sees voltage going across the wire (which I think is a very bad implementation….would like to read more on why they did it this way). So we did this. We administratively brought down every ESX host port. Everything again worked perfectly. We saw one host at a time fail over to the secondary path. Then we rebooted. The Nexus appeared to be coming back up and everything was still responding well. Then BOOM! The exact same behavior happened again.

I had one of my engineers capture the interface logs for some of the ports the ESX hosts were attached to. The Nexus actually brings all of the ports back up for two minutes before applying the running configuration to those ports. So even though they were administratively shut down, it brought them back up for two minutes, the ESX hosts saw active links, failed the paths back over and traffic started moving, the Nexus applied the running configuration which brought the ports back down, and failover happened again. Then of course when we brought each interface up, yet another failback event would have to occur. Yikes!

I personally think this is a very poor implementation of networking fundamentals by VMware but I’m sure there are good reasons. Perhaps that link status is the only thing they can use to trigger failover for some reason. Again, I’d like to understand more about this and research it some more.

So here is what we decided to do as a result of this. Our main goal was to prevent a host declaring itself isolated, triggering the isolation actions:

Diasable Failback (Failback:NO). This one step pretty much eliminates the problem from our point of view and our lab testing confirms this. Since failover works great, everything follows the secondary path during a failure and remains on that path, even when the primary path for a particular host comes back up. Since our entire environment is redundant and since one side of the infrastructure is fully capable of handling the full load, we really don’t care if the traffic gets switched back to the primary path. Yes, it’s nice to know where your traffic is supposed to be going but in reality, VMware tries to balance out the primary and secondary paths anyway. You do give up this balancing functionality if you disable failback but again, for our environment we would rather have rock solid stability.
Set additional isolation addresses. For those that don’t know, if a host fails to receive HA heartbeats from other hosts within a certain time period (13 seconds by default), it will initiate tests to what VMware calls isolation addresses. By default when using ESXi, the isolation address is the gateway of the ESX management VLAN. You can change this and set additional isolation addresses to check. A few options are the upstream Nexus switch (perhaps a loopback address) or if you’re using iSCSI, NFS, or FC0E, you can use an address on your storage device. If the host fails to receive any heartbeats and can’t reach any of the isolation addresses, it will be declared as isolated. That’s when whatever action you’ve set gets triggered.
Make sure you are using portfast. If you have Active/Passive environments using STP, make sure you set your ESX host ports to use portfast. This will allow the port to get put into a forwarding state as quickly as possible instead of waiting for STP to run through the Blocking=>Listening=>Learning=>Forwarding cycle which could take 30-50 seconds depending on your configurations. This will definitely lead to host isolation if you are not running it and your failover scenarios will show a lot of downtime when you test them (and of course in production if you actually implement it that way). Also remember, the Nexus EDGE ports have portfast enabled by default so you don’t have to explicitly call that out in your config.

Hope this helps someone avoid an unplanned/bad/unexpected outage. We’ve learned a lot of lessons about this over the past week. For anyone with a networking background (and virtualization guys that want to know how this works in detail), I recommend the book by Duncan Epping and Frank Denneman titled VMware vSphere 4.1 HA and DRS Technical Deepdive. This book was loaned to me by one of our virtualization engineers and it has some extremely valuable insights into how HA works within a vSphere environment.

Latest Images

Trending Articles

Latest Images