Our plan is to have the blade firmware (interface and BMC) upgrade completed by next Wed. After that it will be time to move up the stack to the I/O devices and manager. Based on schedules these items will probably be schedule in about 2 weeks.
Impact for the last 3 items:
Based on how the system works, if it functions as designed, we should experience up to one minute loss of connectivity or downtime. Explanation of the process:
A chassis contains 2 fabric extenders on an A and B side (called FEX or I/O modules). These are connected to the switch called the fabric interconnect (FI or 6120). The UCS Manager (UCSM) tool resides on the 6120s and runs as an active/passive cluster.
Summary of Steps for I/O Upgrade:
1. Confirm all servers have some form of NIC teaming functioning so they can handle the loss of one I/O path.
2. Confirm the northbound Ethernet connection between the FI (6120) and the 6513 switch have spanning-tree portfast enabled.
3. Pre-stage the new firmware on all 6 (2 per chassis) FEX modules to be applied on next reboot.
4. Pre-stage the new firmware on both of the Fabric Interconnects (6120) to be applied on next reboot (I am trying to confirm that you can “activate in startup version” on the 6120).
5. Reboot only the “passive” Fabric Interconnect (6120). Note the FEX modules are an extension of the FI (6120), so when the FI is rebooted the connected FEX modules reboot at the same time (meaning all of the B side I/O path is rebooted at the same time). It can take up to 10 minutes for the full reboot process to complete. During this time all workload traffic should be functioning via the remaining I/O path.
6. Once step 5 is fully completed perform a reboot on the remaining FI (6120). This should be done soon after confirming step 5 has completed. You want to keep both FI (6120) running the same code version when at all possible.
Again we are told the above steps will most likely result in up to one minute interruption of I/O to the blade servers. That is if all functions as designed.
Summary of Steps for UCSM Upgrade (Post I/O upgrade):
1. Perform the UCSM upgrade on the passive FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
2. Perform the UCSM upgrade on the active FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
The UCSM can be upgraded and restarted without effecting applications running on UCS; it does not impact I/O functions.
Putting this into perspective . . . what is the disruption when you have to update code on an Ethernet switch or Fiber Channel switch?
I am real curious to find out if we do have about 1 minute of I/O disruption. Logically, I would think there is no disruption if you have some level of NIC teaming at the blade and you upgrade the B and then A side. To be continued . . .
We are making progress with the server component of the upgrade process.
Our spare blade with an Emulex mezz. card running 1.1(1l) code for the interface and BMC was ready for a real workload. We put one of our production ESX hosts into maintenance mode and powered off the blade. Then we unassociated it and associated it to the physical blade running the 1.1(1l) code with the same type of Emulex card. The server booted with no issues, we confirmed all was normal and after about 30 minutes we had it back in the live cluster.
We repeated this process on 2 additional ESX hosts and now are running with 3 of the 10 ESX servers in the UCS cluster with the new firmware with no issues. The plan is to do several more tomorrow, maybe the rest of them. Very positive results.
Two ways to update endnode firmware (meaning the blade):
As I was reading through the “how to upgrade UCS…” release notes I recalled some early discussions when I was looking at purchasing UCS. There are 2 ways to update the firmware on the Interface and BMC, etc. We have been using the method of the UCSM tool to go to the physical blade and update it at this level. The other way is via the Service Profile Host Firmware Package policy.
This makes it pretty interesting once you think about it. Instead of thinking of firmware by hardware you think of it as the workload (Service Profile). Lets say my W2K3 server interface can only run on 1.0(2b) firmware and I need to make sure that regardless of the physical blade it is running on that the correct firmware is there. By using a Service Profile firmware policy you can make that happen. So when you move the W2K3 workload from chassis 1 blade 3 to chassis 3 blade 7 the Service Profile (via the Cisco Utility OS) drops in the approved firmware version. Pretty cool to think about.
Note there is at least one drawback to the Service Profile approach. This firmware policy is auto updating, so if you make a change in firmware version it will automatically apply the change and restart the server. This means you have to be careful in how you use this as a means to perform the updates. (when doing firmware updates via UCSM you DO have the ability to control when you reboot the workload).
Turns out Cisco released a new version of code, 1.2(1b) on 3-26-10 for supporting the new M2 blades as well as some bug fixes, etc. We had an internal discussion today on whether or not we should stay the course with 1.1(1l) or jump up to 1.2(1b) code. In the end we decided to stay with 1.1(1l) for 2 reasons.
1. The 1.1 code has been revised/updated 12 times (I think that is what the L means) so my gut tells me it is stable. The 1.2 code has probably been updated 2 times (that is why the B?). I could be off base on this, maybe someone can comment if this is not the correct way of interpreting the numbering.
2. Support for the M2 blade. I expect to have some M2 blades in the environment by June/July timeframe. By then I suspect I will want to be upgrading the code to something newer anyway, so I probably will not be saving a step.
Being the UCS product itself is new and the industry continues to move forward with new CPU, etc. I would suspect it is reasonable to plan on upgrading the firmware/code of a UCS system 2 to 4 times per year. This would depend on your organization’s growth, but here I suspect things will continue as they have been, heavy on the growth side. It keeps it fun.
As part of our firmware upgrade process we will be rebooting the blades for the firmware to take effect. So as part of my prep process I have done some time measures today to get a feel for how long it may take to accomplish our tasks.
I measured times for VMware, W2K3 and the Cisco Utility OS boot times.
ESX 4.0 host:
VMware function: Put into Maint Mode (9 up srv & 8 powered off srv): 3 min. 18 sec.
VMware function: Power down ESX host via vCenter: 45 sec.
UCS function: Click Boot Server until ping and console appear: 4 min. 10 sec.
VMware function: Time for ESX to fully become available in vCenter: 3 min. 13 sec. additional
Windows 2003 Server:
UCS function: Click Boot Server until Windows 2003 splash Screen: 2 min. 16 sec.
W2K3 function: Time between splash screen and login prompt: 54 sec.
W2K3 function: 38 sec.
Cisco Utility OS, i.e., Service Profile Association:
For any server to run on UCS you first have to associate a Service Profile to a physical blade. This is the process in which the hardware abstraction is done by running the Cisco Utility OS on the physical blade. This process does take some time.
UCS function: Associate/Disassociate Service Profile to blade: 5 min. 6 sec.
Note the time to Associate/Disassociate Service Profile to a Blade is something that is only ran when first associating these 2 items or when a change in the Service Profile is done. This does NOT run on every boot.
Driving into work today I realized I left out some key items regarding upgrading code in UCS Manager (UCSM).
One of the things that really makes UCS different is the central UCSM running on chips in the Fabric Interconnects (6120). Having one location to control and manage the full system really simplifies all management tasks, including firmware updates. Ok, that sounds great from a design and marketing perspective but here is the everyday “does it make your job easier” side.
1. All firmware updates are done in one location: Equipment — Firmware Management. This provides a tree structure displaying all components, running version, startup version and backup version. You can filter this view any way you like to only see the components you are interested in. So to see which blades are running x firmware is a quick process.
NOTE: You will see in the screen shot Server 2, 3 and 4 are each in a different state regarding firmware. Server 2 is live running a W2K3 workload with 1.0(2d) version and 1.0(1e) as the backup. Server 3 is in the state of having 1.1(1l) as Startup Version with the Activate Status as pending-next-boot. Then Server 4 is running 1.1(1l) code with 1.0(2d) version as backup. Server 4 is ready for an ESX Service Profile to be moved to it for testing next week.
2. Ability to perform firmware updates/changes to one or many components. Since UCSM gives you a full view of the system you can either select an individual component (blade 2 in chassis 3) or select many (all blades in chassis 1) to perform a firmware task. This worked well for us. After doing the pre-stage (called Update Firmware) and activation of new firmware on 2 individual blades we were comfortable with performing a group pre-stage and activation to the remaining 5 spare blades. This is a parallel process, so it is about the same amount of time to update 1 blade or 10 blades. Also, the UCSM interface is very good at automatically updating where things are in the process via the Server FSM tab or on the General tab — Status Details.
3 Ability to Pre-Stage (Update Firmware) without applying the new firmware. This goes hand in hand with the point above. You can get your new code on all of the components without impacting the servers. I like this ability because it allows you to really control each step of the process and visually see your current state.
4. Ability to move new firmware to “Startup Version” allowing you to control when the new code takes effect. This step is done under the Activate Firmware task. You have the option to either have the activation process only place the new firmware as the Startup Version or it can automatically also reboot the device/component so the new firmware goes into use. This step is nice because you as the admin get to decide how you want to “activate” the change. You can choose a one by one, slow, methodical approach or a quick rollout process based on your organization’s needs.
I hope this information helps with understanding some fundamental differences you have with firmware updates in Cisco UCS.
We have been working on the process for performing the upgrade on the full UCS system to go from 1.0 to 1.1 code, which will support the VIC interface card (Palo Card).
Our approach has been to take things one step at a time and to understand the process as we move forward. We downloaded the 1.1(1l) bin file and release notes and have gone through them. Pretty basic info in the notes, steps you through each part of the process well and provides enough info to do it. However, since we are in production we are using caution, I do not want any surprises.
Steps in full UCS Code Upgrade:
1. Server Interface firmware upgrade, requires a reboot.
2. Server BMC (KVM) firmware upgrade, reboot is non-disruptive.
3. Chassis FEX (fabric extender), requires reboot.
4. UCS Manager, requires reboot.
5. Fabric Interconnect (6120), requires a reboot.
First step was to open a ticket with Cisco. This allows us to document everything in the ticket, allows Cisco to use it to pull in the correct resources and gives us a single point of contact. Next we requested a conference call with TAC and key engineers at Cisco to talk through the process, things to check beforehand, what has worked in their labs, and other customers.
Items like making sure your Ethernet connections northbound (in our case a 6500) have spanning-tree port-fast enabled, confirm your servers are using some type of NIC teaming or have failover enabled on the vNIC in UCSM, and understanding the differences in UCSM between “update firmware” (moves the new code to the device, stages it) and “activate firmware” (moves the new code into startup position so next reboot it takes effect). Special note, when you activate firmware if you do NOT want the item to automatically reboot you need to check the box for “Set Startup Version Only”. This will move the new firmware to the “startup” position but not perform the reboot. When the activation is complete the item will be in “Active Status: pending-next-reboot”. However, you do not have this option when doing the BMC (KVM) update. The BMC update reboot does not effect the server system, so it will reboot once the BMC code is activated.
Most of this information can be found in the release notes and I am sure you don’t want me to just reiterate it to you. We have been through all of the upgrade steps the first week UCS arrived so I am pretty confident the process will work. The challenge now is how to do it with a production system. I am fortunate to have only 17 blades in production. This leaves me with the 1 spare blade with an Emulex mezz. card and 6 spare blades with the VIC mezz. cards. This provides me with a test pool of blades (Note: these are only “spare” until we get through the upgrade process, then they will go into the VMware cluster and be put to use).
Goal: Upgrade UCS as Close to Non-Disruptive as Possible:
Our concern is will a blade function normally running 1.1 code when all the other components are still on 1.0 code? I suspect it will. If it does work, the plan is to take our time to update all of the blades over a week or so following the below steps.
This week we updated and activated the firmware on the Interface Cards for all 7 spare blades. Next we did the same thing for the BMC firmware for the 7 spare blades.
Next step is to put an ESX host into maintenance mode and move the Service Profile to the blade running 1.1 code with an Emulex mezz. card. We can confirm the ESX host functions and then move some test, dev, and then low end prod servers to this ESX host. This will allow us to develop a comfort level with the new code. This step should fit into our schedule early next week. If we see no issues we will be able to proceed with the process of putting additional ESX hosts into Maintenance Mode, then in UCSM update the firmware, activate firmware and then reboot the blade. This process will allow the ESX cluster blades get fully updated with no impact on the applications.
For our W2K8 server running vCenter we will perform a backup and schedule a 15 min downtime to reboot it and move the Service Profile to our spare blade running 1.1 code with the Emulex mezz. card. We can then confirm functionality, etc. If there are issues we can just move the Service Profile back to the server running 1.0 code and we are back in service. This same process will be repeated for the 2 – W2K3 servers running on UCS blades.
By using the flexibility built into VMware that we all use and love (vMotion, DRS, etc.) and the hardware abstraction provided by the UCS Service Profiles the process of updating the blade interface and BMC firmware should be straight forward with minimal impact on the end users. I will give an update next week . . .
I had blogged how clean the installation of a chassis is with needing only 4 cables for I/O and 4 power cables. Via twitter someone asked for a pic of this and I realized I did not have one posted. I have added this pic to my Picasa page that contains about 50 UCS pictures (it is the last one).
Note how open the airflow is when you only have a few cables to deal with. You can see they were paying attention to this detail on the fabric extenders and where the power plugs into; all of the openings.
You can also see I am only using 2 of the 4 possible ports in the fabric extenders (FEX). The options being 1, 2 or 4 connections per each FEX.
We have been running Cisco UCS 4 months now and are preparing for a code upgrade and adding more B200 blades to the system for VMware. So I was thinking what do I really have running in production on the system at this point? It makes sense to have a good handle on this as part of our code upgrade prep work. I put together the below information and figured others could find it useful to get a perspective of what is in production on UCS in the real world (note all of the blades refer to the B200 model running with the Emulex card).
Windows 2008 and 2003 Servers:
I will start with a cool one. Tuesday we went live with our VMware vCenter server loaded bare metal on a UCS blade with boot from SAN. This is W2K8 64 bit, vCenter 2.x with update manager and running SQL 2008 64 bit database (used by vCenter). It has 1 Nehalem 4 core CPU and 12 GB of memory and is running sweet. This is a big show of trust in UCS, the center of my VMware world running on it for the enterprise!
2 server blades boot from SAN (1 prod and 1 test) running W2K3 64 bit with Oracle ver. 10G for our document management system. It has 1 Nehalem 4 core CPU and 48 GB of memory and is running with no issues.
VMware ESX Hosts:
4 production VMware ESX 4.0 hosts with NO EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 48 GB memory. These 4 ESX servers are configured to optimally support W2K8 64 bit Microsoft clusters. We currently are running 4 – 2 node MS clusters on these blades. They are using about 37% of the memory and not really touching the CPU, so we could easily double the number of MS clusters over time on these blades.
10 production VMware ESX 4.0 hosts with EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 96 GB memory. Today we have 87 guest servers running on our UCS VMware cluster. This number increases daily. We are preparing for a few application go-lives that use Citrix XenApp to access the application, so we have another 47 of these servers built and ready to be turned on by the end of the month. So we should have well over 127 guest servers running by then on the UCS VMware cluster.
Here is a summary of the types of production applications/workloads that are up the current 87 guest servers:
NOTE: For the 10 guest servers listed below for data warehouse, they are very heavy on memory (3 with 64 GB, etc.) and we have hard allocated this memory to the guest servers. Meaning the guest is assigned and allocated all 64 GB of memory on boot, even if it is not using it. So, for these large servers they are really using memory resources in VMWare differently than what you normally would do within the shared memory function of VMWare.
10 servers running data warehouse app; 5 heavy SQL 2008 64 bit servers with the rest being web and interfaces.
15 servers for Document Management servers running W2K3 server including IBM Websphere.
39 W2K3 64 bit server running Citrix XenApp 4.5 in production delivering our enterprise applications. The combination of these servers is probably handling applications for about 400 concurrent production users. This will be increasing significantly within 21 days with coming go-lives.
7 W2K8 64 bit servers that provide core Citrix XenApp DB function (SQL 2008) and Citrix Provisioning servers for XenApp servers.
1 W2K3 server running SQL 2005 for computer based learning; production for enterprise.
1 W2K3 server running SQL 2005 for production enterprise staff scheduling system.
3 W2K3 servers running general production applications (shared servers for lower end type apps).
3 W2K3 servers running interface processors for the surgical (OR) application (deals with things like collar-bone surgeries )
1 W2K3 server running a key finance application.
1 W2K3 server running a key pharmacy application.
1 W2K8 server running a pilot SharePoint site (the free version).
There are a few other misc guest servers running as well for various lower end functions, i.e., web servers, etc.
Current VMware Utilization:
Average CPU utilization in the UCS Cluster for the 10 hosts is 8.8%.
The 3 ESX hosts running guest servers with hard allocated 64 GB memory: 76% average.
The 7 ESX hosts running all other workloads: 41% average.
We still have a good amount of growth within our UCS Cluster with 10 servers. I believe I could run this full load on 6 blade servers if I had to for a short period of time.
There you have it, a good summary of what a production Cisco UCS installation looks like in the real world.
Over the years at my organization we often would comment on the quantity of work our IT group is doing to meet the needs of the healthcare organization. We did all of the upgrades for Y2K, expanded with mergers, installed more and more specific systems while not really added FTEs to the mix. Through this time I found the quality of work produced to be very high with a focus on why we are doing what we do; to provide quality healthcare.
Contemplating recently about the successful implementation of Cisco UCS and our clinical data warehouse project it got me thinking about what things went into the success. I have touched on things in this blog about getting funding, short time lines, positive experience with Cisco Advanced Services, the UCS technology and how my staff worked well together.
I think I may have overlooked how important having a staff that works well together can play such a strong role in the success of a UCS implementation. This insight has grown out of a lot of interaction with other organizations around reference calls for Cisco UCS. I have been able to take it for granted skilled engineers with many years of experience in SAN, LAN, servers, virtualization and application delivery who all have the same reporting structure and get along. In addition, management who recognizes individual skills and talents which allows staff to explore new technologies. So I want to give a big thank you to my team for knocking it out of the park with Cisco UCS, you rock!
Yes, Cisco UCS is a very cool, state of the art system for delivering compute capacity but a successful implementation does not just happen. I think UCS can sing when an organization is able to work together as a team to pull all of the components together. Talking with some organizations where the server and network folks do not talk with the storage folks would be a very difficult environment to be successful. On the other hand, putting in a system like Cisco UCS may bring the groups together and may simplify and clarify the need for interactions.
Working in a hospital environment within the IT group you tend not to have much contact with patients. Your focus tends to be on what is the best technology to get the job done in a cost-effective manner. I personally try to think about a family member or friend being in the hospital that will be dependent on the decisions and directions we set in IT. Case and point, in selecting Cisco UCS to run a significant amount of our clinical applications, I had to have enough trust in the system that it would function correctly.
I broke my collar-bone 5 weeks ago and it finally needed to be surgically repaired on Tuesday. Yesterday as I was looking at the x-ray with the new plate and 6 screws I realized that several servers that make up our Surgical Application System is running live on Cisco UCS. Meaning I trust Cisco UCS so much that I had no concerns having surgery using systems running on Cisco UCS. In fact, because of my knowledge of UCS, my comfort level was higher because it is more redundant and flexible.