HealthITGuy's Blog

The evolving world of healthcare IT!

UCS Code Upgrade Success: Running 1.2(1b) Now!!

I have been blogging for a while about the planned code upgrade to our production UCS environment for a while now and we finally cut over to 1.2(1b) on all the system components!  Success.  Here is a quick run down. 

We decided to go with the 1.2(1b) code because the main difference between it and 1.1(1l) was the support for the new Nehalem CPU that will be available in the B200 and B250 M2 blades in about a month.  We want to start running with these new CPUs this summer; more cores means more guests and lower cost. 

The documentation from Cisco on the process was pretty good and provided great step by step instructions and what to expect.  We followed it closely and did not have any issues, all worked as expected. 

We have upgraded the production UCS system!

Here is how we did it:

First step was to perform a backup of the UCS configuration (you always want a fall back, but we did not need it). 

We started with upgrading the BMC on each blade via the Firmware Management Tab; this does not disrupt the servers and was done during the day in about 30 minutes.  We took it slow on the first 8 BMC and then did a batch job for the last 8. 

At 6 PM we ran through the “Prerequisite to Upgrade . . .” document a second time to confirm all components were healthy and ready for an upgrade; no issues.  Next we confirmed that all HBA multipath software was healthy seeing all 4 paths as well as confirmed NIC teaming was healthy; no issues. 

At 6:30 PM we pre-staged the new code on the FEX (IO modules) in each chassis.  This meant we clicked “Set Startup Version Only” for all 6 modules (2 per chassis times 3).  Because we checked the box for “Set Startup Version Only” there was NO disruption of any servers, nothing is rebooted at this time. 

At 6:50 PM we performed the upgrade to the UCS Manager software which is a matter of activating it via the Firmware Management tab.  No issues and it took less than 5 minutes.  We were able to login and perform the remaining tasks listed below when it was complete.  Note, this step does NOT disrupt any server functions, everything continues to work normally. 

At 7:00 PM, it was time for the stressful part, the activation of the new code on the fabric interconnects which results in a reboot of the subordinate side of the UCS system (or the B side in my case).  To prepare for this step we did a few things because all the documentation indicated there can be “up to a minute disruption” of network connectivity (it does NOT impact the storage I/O; fiber channel protocol and multipath takes care of it) during the reboot.  This disruption is related to the arp-cache on the fabric interconnects I believe, here is what we experienced. 

UCS fabric interconnect A is connected to the 6513 core Ethernet switch port group 29.  UCS fabric interconnect B is connected to the 6513 core Ethernet switch port group 30.  During normal functioning the traffic is pretty balanced between the two port groups about 60/40. 

My assumption was that when the B side goes down for the reboot, we would flush the arp-cache for port group 30 and then the 6513 will quickly re-learn all the MAC addresses now reside on port group 29.  Well, it did not actually work like that . . . when the B side rebooted the 6513 cleared the arp-cache on port group 30 right away on its own and it took about 24 seconds (yes I was timing it) for the disrupted traffic to start flowing via port group 29 (the A side).  Once the B side finished its reboot in 11 minutes (the documentation indicated 10 mins.) traffic automatically began flowing through both the A and B sides again as normal. 

So what was happening for the 24 seconds?  I suspect it was the arp-cache on the A side fabric interconnect knew all the MACs that were talking on the B side so it would not pass that traffic until it timed out and relearned. 

As I have posted previously we run our vCenter on a UCS blade using NIC teaming.  I had confirmed vCenter was talking to the network on the A side, so after we experienced the 24 second disruption on the B side I forced my vCenter traffic to the B side before rebooting the A side.  This way we did not drop any packets to vCenter (did this by disabling the NIC in the OS that was connected to the A side and let NIC teaming use only the B side). 

This approach worked great for vCenter, we did not lose connectivity when the A side was rebooted.  However, I should have followed this same approach with all of my ESX hosts because most of them were talking on the A side.  The VMware HA did not like having the 27 second disruption and was confused afterwards for a while (however, full HA did NOT kick in).  All of the hosts came back, as well as all of the guests except for 3.  1 test server, 1 Citrix Provisioning server and 1 database server had to be restarted due to the disruption in network traffic (again the storage I/O was NOT disrupted; mutlipath worked great). 

Summary:

Overall it went very well and we are pleased with the results.  Our remaining tasks are to apply the blade bios updates to the rest of the blades (we did 5 of them tonight) using the Service Profile policy — Host Firmware Packages.  These will be done by putting each ESX into Maintenance Mode and rebooting the blade.  It takes about 2 utility boots for it to take effect or about 10 minutes each server.  Should have this done by Wednesday. 

What I liked:

—  You have control of each step, as the admin you get to decide when to reboot components.
—  You can update each item one at a time or in batches as your comfort level allows.
—  The documentation was correct and accurate. 

What can be improved:

—  Need to eliminate the 24 to 27 second Ethernet disruption which is probably due to the arp-cache.  Cisco has added a “MAC address Table Aging” setting in the Equipment Global Policies area, maybe this already addresses it.

Advertisements

April 27, 2010 - Posted by | Cisco UCS, UCS Manager |

4 Comments »

  1. I’m pretty sure that if you disable the uplink on the Interconnect that it shows the nics associated with that side as disconnected in the ESX host (at least it did during some of our testing). If you were to do that before the upgrade of that side’s interconnect that all of the ESX hosts would automatically fail over to the other side (pretty much immediately) and cut that 24 second outage WAY down.

    That is of course assuming that you have multiple nics associated with a vSwitch or dvSwitch.

    I will be trying it this upcoming week and will let you know. Thanks for the writeup!

    Comment by Ron Russell | April 27, 2010 | Reply

  2. Hi Michael, Regarding the 24 second lapse…. The ARP cache on a 6500 is set to 4 hours by default and this gets typically updated by GARPs. The L2/MAC cache or CAM table is set to 5 minutes by default.

    Losing a port or port channel will cause all the MAC entries learned from that interface to be flushed immediately. At this stage, you will have an ARP entry for the server but no MAC entry. So when traffic destined to the server hits the L2 portion of the 6500, it will have no matching MAC and cause the 6500 to unicast flood the packet. The packet will eventually hit the server and its reply will repopulate the CAM table. So the MAC learning process contributed _some_ to the delay.

    Comment by Lambert Orejola | April 27, 2010 | Reply

  3. Great post. We’re getting ready to do our first firmware upgrade this weekend. I will be keeping a close eye on any Ethernet disruptions.

    Comment by Mike Hurst | May 26, 2010 | Reply

    • Mike, how did your upgrade go?
      Mike

      Comment by healthitguy | June 7, 2010 | Reply


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: