Our plan is to have the blade firmware (interface and BMC) upgrade completed by next Wed. After that it will be time to move up the stack to the I/O devices and manager. Based on schedules these items will probably be schedule in about 2 weeks.
Impact for the last 3 items:
Based on how the system works, if it functions as designed, we should experience up to one minute loss of connectivity or downtime. Explanation of the process:
A chassis contains 2 fabric extenders on an A and B side (called FEX or I/O modules). These are connected to the switch called the fabric interconnect (FI or 6120). The UCS Manager (UCSM) tool resides on the 6120s and runs as an active/passive cluster.
Summary of Steps for I/O Upgrade:
1. Confirm all servers have some form of NIC teaming functioning so they can handle the loss of one I/O path.
2. Confirm the northbound Ethernet connection between the FI (6120) and the 6513 switch have spanning-tree portfast enabled.
3. Pre-stage the new firmware on all 6 (2 per chassis) FEX modules to be applied on next reboot.
4. Pre-stage the new firmware on both of the Fabric Interconnects (6120) to be applied on next reboot (I am trying to confirm that you can “activate in startup version” on the 6120).
5. Reboot only the “passive” Fabric Interconnect (6120). Note the FEX modules are an extension of the FI (6120), so when the FI is rebooted the connected FEX modules reboot at the same time (meaning all of the B side I/O path is rebooted at the same time). It can take up to 10 minutes for the full reboot process to complete. During this time all workload traffic should be functioning via the remaining I/O path.
6. Once step 5 is fully completed perform a reboot on the remaining FI (6120). This should be done soon after confirming step 5 has completed. You want to keep both FI (6120) running the same code version when at all possible.
Again we are told the above steps will most likely result in up to one minute interruption of I/O to the blade servers. That is if all functions as designed.
Summary of Steps for UCSM Upgrade (Post I/O upgrade):
1. Perform the UCSM upgrade on the passive FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
2. Perform the UCSM upgrade on the active FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
The UCSM can be upgraded and restarted without effecting applications running on UCS; it does not impact I/O functions.
Putting this into perspective . . . what is the disruption when you have to update code on an Ethernet switch or Fiber Channel switch?
I am real curious to find out if we do have about 1 minute of I/O disruption. Logically, I would think there is no disruption if you have some level of NIC teaming at the blade and you upgrade the B and then A side. To be continued . . .
We are making progress with the server component of the upgrade process.
Our spare blade with an Emulex mezz. card running 1.1(1l) code for the interface and BMC was ready for a real workload. We put one of our production ESX hosts into maintenance mode and powered off the blade. Then we unassociated it and associated it to the physical blade running the 1.1(1l) code with the same type of Emulex card. The server booted with no issues, we confirmed all was normal and after about 30 minutes we had it back in the live cluster.
We repeated this process on 2 additional ESX hosts and now are running with 3 of the 10 ESX servers in the UCS cluster with the new firmware with no issues. The plan is to do several more tomorrow, maybe the rest of them. Very positive results.
Two ways to update endnode firmware (meaning the blade):
As I was reading through the “how to upgrade UCS…” release notes I recalled some early discussions when I was looking at purchasing UCS. There are 2 ways to update the firmware on the Interface and BMC, etc. We have been using the method of the UCSM tool to go to the physical blade and update it at this level. The other way is via the Service Profile Host Firmware Package policy.
This makes it pretty interesting once you think about it. Instead of thinking of firmware by hardware you think of it as the workload (Service Profile). Lets say my W2K3 server interface can only run on 1.0(2b) firmware and I need to make sure that regardless of the physical blade it is running on that the correct firmware is there. By using a Service Profile firmware policy you can make that happen. So when you move the W2K3 workload from chassis 1 blade 3 to chassis 3 blade 7 the Service Profile (via the Cisco Utility OS) drops in the approved firmware version. Pretty cool to think about.
Note there is at least one drawback to the Service Profile approach. This firmware policy is auto updating, so if you make a change in firmware version it will automatically apply the change and restart the server. This means you have to be careful in how you use this as a means to perform the updates. (when doing firmware updates via UCSM you DO have the ability to control when you reboot the workload).
Turns out Cisco released a new version of code, 1.2(1b) on 3-26-10 for supporting the new M2 blades as well as some bug fixes, etc. We had an internal discussion today on whether or not we should stay the course with 1.1(1l) or jump up to 1.2(1b) code. In the end we decided to stay with 1.1(1l) for 2 reasons.
1. The 1.1 code has been revised/updated 12 times (I think that is what the L means) so my gut tells me it is stable. The 1.2 code has probably been updated 2 times (that is why the B?). I could be off base on this, maybe someone can comment if this is not the correct way of interpreting the numbering.
2. Support for the M2 blade. I expect to have some M2 blades in the environment by June/July timeframe. By then I suspect I will want to be upgrading the code to something newer anyway, so I probably will not be saving a step.
Being the UCS product itself is new and the industry continues to move forward with new CPU, etc. I would suspect it is reasonable to plan on upgrading the firmware/code of a UCS system 2 to 4 times per year. This would depend on your organization’s growth, but here I suspect things will continue as they have been, heavy on the growth side. It keeps it fun.
Driving into work today I realized I left out some key items regarding upgrading code in UCS Manager (UCSM).
One of the things that really makes UCS different is the central UCSM running on chips in the Fabric Interconnects (6120). Having one location to control and manage the full system really simplifies all management tasks, including firmware updates. Ok, that sounds great from a design and marketing perspective but here is the everyday “does it make your job easier” side.
1. All firmware updates are done in one location: Equipment — Firmware Management. This provides a tree structure displaying all components, running version, startup version and backup version. You can filter this view any way you like to only see the components you are interested in. So to see which blades are running x firmware is a quick process.
NOTE: You will see in the screen shot Server 2, 3 and 4 are each in a different state regarding firmware. Server 2 is live running a W2K3 workload with 1.0(2d) version and 1.0(1e) as the backup. Server 3 is in the state of having 1.1(1l) as Startup Version with the Activate Status as pending-next-boot. Then Server 4 is running 1.1(1l) code with 1.0(2d) version as backup. Server 4 is ready for an ESX Service Profile to be moved to it for testing next week.
2. Ability to perform firmware updates/changes to one or many components. Since UCSM gives you a full view of the system you can either select an individual component (blade 2 in chassis 3) or select many (all blades in chassis 1) to perform a firmware task. This worked well for us. After doing the pre-stage (called Update Firmware) and activation of new firmware on 2 individual blades we were comfortable with performing a group pre-stage and activation to the remaining 5 spare blades. This is a parallel process, so it is about the same amount of time to update 1 blade or 10 blades. Also, the UCSM interface is very good at automatically updating where things are in the process via the Server FSM tab or on the General tab — Status Details.
3 Ability to Pre-Stage (Update Firmware) without applying the new firmware. This goes hand in hand with the point above. You can get your new code on all of the components without impacting the servers. I like this ability because it allows you to really control each step of the process and visually see your current state.
4. Ability to move new firmware to “Startup Version” allowing you to control when the new code takes effect. This step is done under the Activate Firmware task. You have the option to either have the activation process only place the new firmware as the Startup Version or it can automatically also reboot the device/component so the new firmware goes into use. This step is nice because you as the admin get to decide how you want to “activate” the change. You can choose a one by one, slow, methodical approach or a quick rollout process based on your organization’s needs.
I hope this information helps with understanding some fundamental differences you have with firmware updates in Cisco UCS.