HealthITGuy's Blog

The evolving world of healthcare IT!

Cisco UCS 1.4 Code: Upgrade is done, this is cool stuff

I am geeked! We just completed the code
upgrade on our production Cisco UCS environment and it was
awesome!

We have been in production on Cisco UCS for 1
year and 22 days now and have ran on 1.1, 1.2, 1.3 and now 1.4
code. So today was our 3rd code upgrade process on a
production environment and each time things have gotten better and
cleaner. Why am I so excited? Think about it . . . with UCS
there are 2 Fabric Interconnects with some chassis hanging off of
them with a bunch of servers all using a single point of
management. Everything is connected with redundancy and if
all of that redundancy is operational and live you truly can reboot
half your IO fabric and not drop a ping or storage connection.
In a storage world this is standard and expected but in a
server blade world you would think to accomplish the same level of
high availability and uptime provided by a SAN, there
would have to be a lot of complexity. Enter Cisco UCS! An
hour ago we upgraded and rebooted half our IO infrastructure that
serves over 208 production VM Guest Servers running on 21 VMware
ESX hosts and another 8 Windows server (all running active SQL or
Oracle databases) blades without dropping a packet. Then I
did the same thing to the other IO infrastructure path with NO
ISSUES. This is just badass. I suspect in a year this
type of redundancy and HA level in a x86 server
environment will be an expectation and not an exception.

UCS Code Upgrade Experiences:

In March 2010 we
performed the first upgrade while in production to 1.2 code (you
can check out my blog post for all the details). The major
impact we experienced with this one was due to a human issue; we
forgot to enable spanning tree port fast for the EtherChannels
connecting our Fabric Interconnects. Our fault, issue fixed,
move on. In December 2010 we implemented 1.3 code for a few reasons
mainly related to ESX 4.1 and Nexus 1K. Our only issue here
was with 1 Windows 2003 64-bit server running on a B200 blade with
OS NIC Teaming which failed to work correctly. Again,
not a UCS code issue but a server OS teaming issue. We had 3
servers using NIC Teaming in the OS, so we decided to change these
servers to hardware failover mode provided in UCS instead of in the
OS. Changes made, ready to move on. It just so happened on
the same day we did the 1.3 upgrade Cisco released 1.4 code just in
time for Christmas (thanks SAVBU). This time we had all our
bases covered and each step worked as expected; no spanning tree
issues, no OS NIC Teaming problems, it was smooth! There was
some risk with moving to the new code so fast, but we have several
projects that are needing the new B230 blades ASAP. There are
several UCS users and partners that have already been going through
1.4 testing and things have been looking very good. Thanks to
all who provided me with feedback over the last week.

New
Features and Functions:

Now we get to dig into all the
new cool and functional features in the new code. I am
impressed already. I will put together a separate posts with
my first impressions. I do want to point out one key thing that I
referenced above; the need to upgrade the infrastructure to use new
hardware (B230 blades). Now that I am on 1.4 code this
requirement is gone. Yep, with 1.4 code, they have made
changes that will NOT require a upgrade of the IO infrastructure
(Fabric Interconnects and UCS Manager) to use new hardware like a
B230. So yes, things are sweet with Cisco UCS and it just got
even better.

Advertisements

December 29, 2010 Posted by | Cisco UCS, General, VMWare | , , | Leave a comment

UCS Code Upgrade: 24 Seconds Explained

There was a lot of interest in my last post about our code upgrade.  What stood out to many (as well as us) was the 24 second disruption in network traffic on each fabric interconnect side when it was rebooted.  Meaning I upgrade the B side and when that fabric interconnect goes into its reboot all of the Ethernet traffic (remember the fiber channel traffic had NO issues) that was talking on the B side is disrupted for 24 seconds.

The answer was in my core 6513 configuration regarding spanning tree.  I would like to thank Jeremiah at Varrow and the guys at Cisco who helped us figure this out. 

Turns out that one of the first configuration confirmation items in the code upgrade process (really it should have been setup all along . . .) was making sure the port channels that the fabric interconnects are connected to are set with spanning-tree portfast trunk.  An email was sent to get this confirmed and configured but it got missed, to bad it was not in the Cisco Pre-Requisite document as a reminder.  What this command gives you is if and when the trunk port link to the fabric interconnect goes away for any reason the 6513 will not go through the normal spanning tree timers and quickly allow the traffic to flow on the remaining path (in our case the remaining connection to the fabric interconnects).

We have now enabled spanning-tree portfast trunk on our port channels and should be positioned now to eliminate that pesky 24 second Ethernet disruption that impacted some of the traffic.  Details, details!

May 1, 2010 Posted by | Cisco UCS, UCS Manager | , , | Leave a comment

UCS Code Upgrade Success: Running 1.2(1b) Now!!

I have been blogging for a while about the planned code upgrade to our production UCS environment for a while now and we finally cut over to 1.2(1b) on all the system components!  Success.  Here is a quick run down. 

We decided to go with the 1.2(1b) code because the main difference between it and 1.1(1l) was the support for the new Nehalem CPU that will be available in the B200 and B250 M2 blades in about a month.  We want to start running with these new CPUs this summer; more cores means more guests and lower cost. 

The documentation from Cisco on the process was pretty good and provided great step by step instructions and what to expect.  We followed it closely and did not have any issues, all worked as expected. 

We have upgraded the production UCS system!

Here is how we did it:

First step was to perform a backup of the UCS configuration (you always want a fall back, but we did not need it). 

We started with upgrading the BMC on each blade via the Firmware Management Tab; this does not disrupt the servers and was done during the day in about 30 minutes.  We took it slow on the first 8 BMC and then did a batch job for the last 8. 

At 6 PM we ran through the “Prerequisite to Upgrade . . .” document a second time to confirm all components were healthy and ready for an upgrade; no issues.  Next we confirmed that all HBA multipath software was healthy seeing all 4 paths as well as confirmed NIC teaming was healthy; no issues. 

At 6:30 PM we pre-staged the new code on the FEX (IO modules) in each chassis.  This meant we clicked “Set Startup Version Only” for all 6 modules (2 per chassis times 3).  Because we checked the box for “Set Startup Version Only” there was NO disruption of any servers, nothing is rebooted at this time. 

At 6:50 PM we performed the upgrade to the UCS Manager software which is a matter of activating it via the Firmware Management tab.  No issues and it took less than 5 minutes.  We were able to login and perform the remaining tasks listed below when it was complete.  Note, this step does NOT disrupt any server functions, everything continues to work normally. 

At 7:00 PM, it was time for the stressful part, the activation of the new code on the fabric interconnects which results in a reboot of the subordinate side of the UCS system (or the B side in my case).  To prepare for this step we did a few things because all the documentation indicated there can be “up to a minute disruption” of network connectivity (it does NOT impact the storage I/O; fiber channel protocol and multipath takes care of it) during the reboot.  This disruption is related to the arp-cache on the fabric interconnects I believe, here is what we experienced. 

UCS fabric interconnect A is connected to the 6513 core Ethernet switch port group 29.  UCS fabric interconnect B is connected to the 6513 core Ethernet switch port group 30.  During normal functioning the traffic is pretty balanced between the two port groups about 60/40. 

My assumption was that when the B side goes down for the reboot, we would flush the arp-cache for port group 30 and then the 6513 will quickly re-learn all the MAC addresses now reside on port group 29.  Well, it did not actually work like that . . . when the B side rebooted the 6513 cleared the arp-cache on port group 30 right away on its own and it took about 24 seconds (yes I was timing it) for the disrupted traffic to start flowing via port group 29 (the A side).  Once the B side finished its reboot in 11 minutes (the documentation indicated 10 mins.) traffic automatically began flowing through both the A and B sides again as normal. 

So what was happening for the 24 seconds?  I suspect it was the arp-cache on the A side fabric interconnect knew all the MACs that were talking on the B side so it would not pass that traffic until it timed out and relearned. 

As I have posted previously we run our vCenter on a UCS blade using NIC teaming.  I had confirmed vCenter was talking to the network on the A side, so after we experienced the 24 second disruption on the B side I forced my vCenter traffic to the B side before rebooting the A side.  This way we did not drop any packets to vCenter (did this by disabling the NIC in the OS that was connected to the A side and let NIC teaming use only the B side). 

This approach worked great for vCenter, we did not lose connectivity when the A side was rebooted.  However, I should have followed this same approach with all of my ESX hosts because most of them were talking on the A side.  The VMware HA did not like having the 27 second disruption and was confused afterwards for a while (however, full HA did NOT kick in).  All of the hosts came back, as well as all of the guests except for 3.  1 test server, 1 Citrix Provisioning server and 1 database server had to be restarted due to the disruption in network traffic (again the storage I/O was NOT disrupted; mutlipath worked great). 

Summary:

Overall it went very well and we are pleased with the results.  Our remaining tasks are to apply the blade bios updates to the rest of the blades (we did 5 of them tonight) using the Service Profile policy — Host Firmware Packages.  These will be done by putting each ESX into Maintenance Mode and rebooting the blade.  It takes about 2 utility boots for it to take effect or about 10 minutes each server.  Should have this done by Wednesday. 

What I liked:

—  You have control of each step, as the admin you get to decide when to reboot components.
—  You can update each item one at a time or in batches as your comfort level allows.
—  The documentation was correct and accurate. 

What can be improved:

—  Need to eliminate the 24 to 27 second Ethernet disruption which is probably due to the arp-cache.  Cisco has added a “MAC address Table Aging” setting in the Equipment Global Policies area, maybe this already addresses it.

April 27, 2010 Posted by | Cisco UCS, UCS Manager | | 4 Comments

UCS Upgrade: I/O Path Update 2 (quick)

In my last post regarding the UCS firmware upgrade process I described our proposed upgrade steps for the FEX modules and the Fabric Interconnects.  At the time the best information I had gathered indicated (release notes, TAC, etc.) the possibility of one minute interruption of I/O.

One of the great things about coming out to Cisco in San Jose is to be able to talk to the people who build this stuff in the business unit.  I will have more in-depth discussions tomorrow, but the quick info is some clarification on the “one minute interruption of I/O”.  It turns out this is really only referring to the Ethernet side of the I/O.  Your fiber channel storage area network will not have any disruption because of the multipath nature of FC, so no SAN connectivity disruption.  Next on the Ethernet side it sounds like if there is a disruption it really comes down to the MAC tables on the fabric interconnects and a few other failover functions.  The smaller the MAC tables the quicker the “failover”. 

Like I said, I will get more details tomorrow but wanted to get this new info out to everyone.  I am feeling pretty good about our upgrade next week.

April 6, 2010 Posted by | Cisco UCS | , | Leave a comment

UCS Upgrade: I/O Path Update 1

Our plan is to have the blade firmware (interface and BMC) upgrade completed by next Wed.  After that it will be time to move up the stack to the I/O devices and manager.  Based on schedules these items will probably be schedule in about 2 weeks.

Impact for the last 3 items:

Based on how the system works, if it functions as designed, we should experience up to one minute loss of connectivity or downtime.  Explanation of the process:

A chassis contains 2 fabric extenders on an A and B side (called FEX or I/O modules).  These are connected to the switch called the fabric interconnect (FI or 6120).  The UCS Manager (UCSM) tool resides on the 6120s and runs as an active/passive cluster.

Summary of Steps for I/O Upgrade:

1.  Confirm all servers have some form of NIC teaming functioning so they can handle the loss of one I/O path.

2.  Confirm the northbound Ethernet connection between the FI (6120) and the 6513 switch have spanning-tree portfast enabled.

3.  Pre-stage the new firmware on all 6 (2 per chassis) FEX modules to be applied on next reboot.

4.  Pre-stage the new firmware on both of the Fabric Interconnects (6120) to be applied on next reboot (I am trying to confirm that you can “activate in startup version” on the 6120).

5.  Reboot only the “passive” Fabric Interconnect (6120).  Note the FEX modules are an extension of the FI (6120), so when the FI is rebooted the connected FEX modules reboot at the same time (meaning all of the B side I/O path is rebooted at the same time).  It can take up to 10 minutes for the full reboot process to complete.  During this time all workload traffic should be functioning via the remaining I/O path.

6.  Once step 5 is fully completed perform a reboot on the remaining FI (6120).  This should be done soon after confirming step 5 has completed.  You want to keep both FI (6120) running the same code version when at all possible.

Again we are told the above steps will most likely result in up to one minute interruption of I/O to the blade servers.  That is if all functions as designed.

Summary of Steps for UCSM Upgrade (Post I/O upgrade):

1.  Perform the UCSM upgrade on the passive FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).

2.  Perform the UCSM upgrade on the active FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).

The UCSM can be upgraded and restarted without effecting applications running on UCS; it does not impact I/O functions.

Putting this into perspective . . . what is the disruption when you have to update code on an Ethernet switch or Fiber Channel switch? 

I am real curious to find out if we do have about 1 minute of I/O disruption.  Logically, I would think there is no disruption if you have some level of NIC teaming at the blade and you upgrade the B and then A side.  To be continued . . .

March 31, 2010 Posted by | Cisco UCS, UCS Manager | , , | Leave a comment

UCS Upgrade: Step One the Servers: Part 3

We are making progress with the server component of the upgrade process.

Our spare blade with an Emulex mezz. card running 1.1(1l) code for the interface and BMC was ready for a real workload.  We put one of our production ESX hosts into maintenance mode and powered off the blade.  Then we unassociated it and associated it to the physical blade running the 1.1(1l) code with the same type of Emulex card.  The server booted with no issues, we confirmed all was normal and after about 30 minutes we had it back in the live cluster.

We repeated this process on 2 additional ESX hosts and now are running with 3 of the 10 ESX servers in the UCS cluster with the new firmware with no issues.  The plan is to do several more tomorrow, maybe the rest of them.  Very positive results.

Two ways to update endnode firmware (meaning the blade):

As I was reading through the “how to upgrade UCS…” release notes I recalled some early discussions when I was looking at purchasing UCS.  There are 2 ways to update the firmware on the Interface and BMC, etc.  We have been using the method of the UCSM tool to go to the physical blade and update it at this level.  The other way is via the Service Profile Host Firmware Package policy.

This makes it pretty interesting once you think about it.  Instead of thinking of firmware by hardware you think of it as the workload (Service Profile).  Lets say my W2K3 server interface can only run on 1.0(2b) firmware and I need to make sure that regardless of the physical blade it is running on that the correct firmware is there.  By using a Service Profile firmware policy you can make that happen.  So when you move the W2K3 workload from chassis 1 blade 3 to chassis 3 blade 7 the Service Profile (via the Cisco Utility OS) drops in the approved firmware version.  Pretty cool to think about.

Note there is at least one drawback to the Service Profile approach.  This firmware policy is auto updating, so if you make a change in firmware version it will automatically apply the change and restart the server. This means you have to be careful in how you use this as a means to perform the updates. (when doing firmware updates via UCSM you DO have the ability to control when you reboot the workload).

March 30, 2010 Posted by | Cisco UCS, UCS Manager | , , | Leave a comment

UCS Code Versions: How to choose . . .

Turns out Cisco released a new version of code, 1.2(1b) on 3-26-10 for supporting the new M2 blades as well as some bug fixes, etc.  We had an internal discussion today on whether or not we should stay the course with 1.1(1l) or jump up to 1.2(1b) code.  In the end we decided to stay with 1.1(1l) for 2 reasons.

1.  The 1.1 code has been revised/updated 12 times (I think that is what the L means) so my gut tells me it is stable.  The 1.2 code has probably been updated 2 times (that is why the B?).  I could be off base on this, maybe someone can comment if this is not the correct way of interpreting the numbering.

2.  Support for the M2 blade.  I expect to have some M2 blades in the environment by June/July timeframe.  By then I suspect I will want to be upgrading the code to something newer anyway, so I probably will not be saving a step.

Being the UCS product itself is new and the industry continues to move forward with new CPU, etc. I would suspect it is reasonable to plan on upgrading the firmware/code of a UCS system 2 to 4 times per year.  This would depend on your organization’s growth, but here I suspect things will continue as they have been, heavy on the growth side.  It keeps it fun.

March 30, 2010 Posted by | Cisco UCS | , | 1 Comment

UCS Upgrade: Step One the Servers

We have been working on the process for performing the upgrade on the full UCS system to go from 1.0 to 1.1 code, which will support the VIC interface card (Palo Card).

Our approach has been to take things one step at a time and to understand the process as we move forward.  We downloaded the 1.1(1l) bin file and release notes and have gone through them.  Pretty basic info in the notes, steps you through each part of the process well and provides enough info to do it.  However, since we are in production we are using caution, I do not want any surprises.

Steps in full UCS Code Upgrade:

1.  Server Interface firmware upgrade, requires a reboot.
2.  Server BMC (KVM) firmware upgrade, reboot is non-disruptive.
3.  Chassis FEX (fabric extender), requires reboot.
4.  UCS Manager, requires reboot.
5.  Fabric Interconnect (6120), requires a reboot.

First step was to open a ticket with Cisco.  This allows us to document everything in the ticket, allows Cisco to use it to pull in the correct resources and gives us a single point of contact.  Next we requested a conference call with TAC and key engineers at Cisco to talk through the process, things to check beforehand, what has worked in their labs, and other customers.

Items like making sure your Ethernet connections northbound (in our case a 6500) have spanning-tree port-fast enabled, confirm your servers are using some type of NIC teaming or have failover enabled on the vNIC in UCSM, and understanding the differences in UCSM between “update firmware” (moves the new code to the device, stages it) and “activate firmware” (moves the new code into startup position so next reboot it takes effect).  Special note, when you activate firmware if you do NOT want the item to automatically reboot you need to check the box for “Set Startup Version Only”.  This will move the new firmware to the “startup” position but not perform the reboot.  When the activation is complete the item will be in “Active Status:  pending-next-reboot”.  However, you do not have this option when doing the BMC (KVM) update.  The BMC update reboot does not effect the server system, so it will reboot once the BMC code is activated.

Most of this information can be found in the release notes and I am sure you don’t want me to just reiterate it to you.  We have been through all of the upgrade steps the first week UCS arrived so I am pretty confident the process will work.  The challenge now is how to do it with a production system.  I am fortunate to have only 17 blades in production.  This leaves me with the 1 spare blade with an Emulex mezz. card and 6 spare blades with the VIC mezz. cards.  This provides me with a test pool of blades (Note: these are only “spare” until we get through the upgrade process, then they will go into the VMware cluster and be put to use).

Goal:  Upgrade UCS as Close to Non-Disruptive as Possible:

Our concern is will a blade function normally running 1.1 code when all the other components are still on 1.0 code?  I suspect it will.  If it does work, the plan is to take our time to update all of the blades over a week or so following the below steps.

This week we updated and activated the firmware on the Interface Cards for all 7 spare blades.  Next we did the same thing for the BMC firmware for the 7 spare blades.

Next step is to put an ESX host into maintenance mode and move the Service Profile to the blade running 1.1 code with an Emulex mezz. card.  We can confirm the ESX host functions and then move some test, dev, and then low end prod servers to this ESX host.  This will allow us to develop a comfort level with the new code.  This step should fit into our schedule early next week.  If we see no issues we will be able to proceed with the process of putting additional ESX hosts into Maintenance Mode, then in UCSM update the firmware, activate firmware and then reboot the blade.  This process will allow the ESX cluster blades get fully updated with no impact on the applications.

For our W2K8 server running vCenter we will perform a backup and schedule a 15 min downtime to reboot it and move the Service Profile to our spare blade running 1.1 code with the Emulex mezz. card.  We can then confirm functionality, etc.  If there are issues we can just move the Service Profile back to the server running 1.0 code and we are back in service.  This same process will be repeated for the 2 – W2K3 servers running on UCS blades.

Summary:

By using the flexibility built into VMware that we all use and love (vMotion, DRS, etc.) and the hardware abstraction provided by the UCS Service Profiles the process of  updating the blade interface and BMC firmware should be straight forward with minimal impact on the end users.  I will give an update next week . . .

March 25, 2010 Posted by | Cisco UCS, UCS Manager, VMWare | , | 2 Comments