Driving into work today I realized I left out some key items regarding upgrading code in UCS Manager (UCSM).
One of the things that really makes UCS different is the central UCSM running on chips in the Fabric Interconnects (6120). Having one location to control and manage the full system really simplifies all management tasks, including firmware updates. Ok, that sounds great from a design and marketing perspective but here is the everyday “does it make your job easier” side.
1. All firmware updates are done in one location: Equipment — Firmware Management. This provides a tree structure displaying all components, running version, startup version and backup version. You can filter this view any way you like to only see the components you are interested in. So to see which blades are running x firmware is a quick process.
NOTE: You will see in the screen shot Server 2, 3 and 4 are each in a different state regarding firmware. Server 2 is live running a W2K3 workload with 1.0(2d) version and 1.0(1e) as the backup. Server 3 is in the state of having 1.1(1l) as Startup Version with the Activate Status as pending-next-boot. Then Server 4 is running 1.1(1l) code with 1.0(2d) version as backup. Server 4 is ready for an ESX Service Profile to be moved to it for testing next week.
2. Ability to perform firmware updates/changes to one or many components. Since UCSM gives you a full view of the system you can either select an individual component (blade 2 in chassis 3) or select many (all blades in chassis 1) to perform a firmware task. This worked well for us. After doing the pre-stage (called Update Firmware) and activation of new firmware on 2 individual blades we were comfortable with performing a group pre-stage and activation to the remaining 5 spare blades. This is a parallel process, so it is about the same amount of time to update 1 blade or 10 blades. Also, the UCSM interface is very good at automatically updating where things are in the process via the Server FSM tab or on the General tab — Status Details.
3 Ability to Pre-Stage (Update Firmware) without applying the new firmware. This goes hand in hand with the point above. You can get your new code on all of the components without impacting the servers. I like this ability because it allows you to really control each step of the process and visually see your current state.
4. Ability to move new firmware to “Startup Version” allowing you to control when the new code takes effect. This step is done under the Activate Firmware task. You have the option to either have the activation process only place the new firmware as the Startup Version or it can automatically also reboot the device/component so the new firmware goes into use. This step is nice because you as the admin get to decide how you want to “activate” the change. You can choose a one by one, slow, methodical approach or a quick rollout process based on your organization’s needs.
I hope this information helps with understanding some fundamental differences you have with firmware updates in Cisco UCS.
We have been working on the process for performing the upgrade on the full UCS system to go from 1.0 to 1.1 code, which will support the VIC interface card (Palo Card).
Our approach has been to take things one step at a time and to understand the process as we move forward. We downloaded the 1.1(1l) bin file and release notes and have gone through them. Pretty basic info in the notes, steps you through each part of the process well and provides enough info to do it. However, since we are in production we are using caution, I do not want any surprises.
Steps in full UCS Code Upgrade:
1. Server Interface firmware upgrade, requires a reboot.
2. Server BMC (KVM) firmware upgrade, reboot is non-disruptive.
3. Chassis FEX (fabric extender), requires reboot.
4. UCS Manager, requires reboot.
5. Fabric Interconnect (6120), requires a reboot.
First step was to open a ticket with Cisco. This allows us to document everything in the ticket, allows Cisco to use it to pull in the correct resources and gives us a single point of contact. Next we requested a conference call with TAC and key engineers at Cisco to talk through the process, things to check beforehand, what has worked in their labs, and other customers.
Items like making sure your Ethernet connections northbound (in our case a 6500) have spanning-tree port-fast enabled, confirm your servers are using some type of NIC teaming or have failover enabled on the vNIC in UCSM, and understanding the differences in UCSM between “update firmware” (moves the new code to the device, stages it) and “activate firmware” (moves the new code into startup position so next reboot it takes effect). Special note, when you activate firmware if you do NOT want the item to automatically reboot you need to check the box for “Set Startup Version Only”. This will move the new firmware to the “startup” position but not perform the reboot. When the activation is complete the item will be in “Active Status: pending-next-reboot”. However, you do not have this option when doing the BMC (KVM) update. The BMC update reboot does not effect the server system, so it will reboot once the BMC code is activated.
Most of this information can be found in the release notes and I am sure you don’t want me to just reiterate it to you. We have been through all of the upgrade steps the first week UCS arrived so I am pretty confident the process will work. The challenge now is how to do it with a production system. I am fortunate to have only 17 blades in production. This leaves me with the 1 spare blade with an Emulex mezz. card and 6 spare blades with the VIC mezz. cards. This provides me with a test pool of blades (Note: these are only “spare” until we get through the upgrade process, then they will go into the VMware cluster and be put to use).
Goal: Upgrade UCS as Close to Non-Disruptive as Possible:
Our concern is will a blade function normally running 1.1 code when all the other components are still on 1.0 code? I suspect it will. If it does work, the plan is to take our time to update all of the blades over a week or so following the below steps.
This week we updated and activated the firmware on the Interface Cards for all 7 spare blades. Next we did the same thing for the BMC firmware for the 7 spare blades.
Next step is to put an ESX host into maintenance mode and move the Service Profile to the blade running 1.1 code with an Emulex mezz. card. We can confirm the ESX host functions and then move some test, dev, and then low end prod servers to this ESX host. This will allow us to develop a comfort level with the new code. This step should fit into our schedule early next week. If we see no issues we will be able to proceed with the process of putting additional ESX hosts into Maintenance Mode, then in UCSM update the firmware, activate firmware and then reboot the blade. This process will allow the ESX cluster blades get fully updated with no impact on the applications.
For our W2K8 server running vCenter we will perform a backup and schedule a 15 min downtime to reboot it and move the Service Profile to our spare blade running 1.1 code with the Emulex mezz. card. We can then confirm functionality, etc. If there are issues we can just move the Service Profile back to the server running 1.0 code and we are back in service. This same process will be repeated for the 2 – W2K3 servers running on UCS blades.
By using the flexibility built into VMware that we all use and love (vMotion, DRS, etc.) and the hardware abstraction provided by the UCS Service Profiles the process of updating the blade interface and BMC firmware should be straight forward with minimal impact on the end users. I will give an update next week . . .
I had blogged how clean the installation of a chassis is with needing only 4 cables for I/O and 4 power cables. Via twitter someone asked for a pic of this and I realized I did not have one posted. I have added this pic to my Picasa page that contains about 50 UCS pictures (it is the last one).
Note how open the airflow is when you only have a few cables to deal with. You can see they were paying attention to this detail on the fabric extenders and where the power plugs into; all of the openings.
You can also see I am only using 2 of the 4 possible ports in the fabric extenders (FEX). The options being 1, 2 or 4 connections per each FEX.
We have been running Cisco UCS 4 months now and are preparing for a code upgrade and adding more B200 blades to the system for VMware. So I was thinking what do I really have running in production on the system at this point? It makes sense to have a good handle on this as part of our code upgrade prep work. I put together the below information and figured others could find it useful to get a perspective of what is in production on UCS in the real world (note all of the blades refer to the B200 model running with the Emulex card).
Windows 2008 and 2003 Servers:
I will start with a cool one. Tuesday we went live with our VMware vCenter server loaded bare metal on a UCS blade with boot from SAN. This is W2K8 64 bit, vCenter 2.x with update manager and running SQL 2008 64 bit database (used by vCenter). It has 1 Nehalem 4 core CPU and 12 GB of memory and is running sweet. This is a big show of trust in UCS, the center of my VMware world running on it for the enterprise!
2 server blades boot from SAN (1 prod and 1 test) running W2K3 64 bit with Oracle ver. 10G for our document management system. It has 1 Nehalem 4 core CPU and 48 GB of memory and is running with no issues.
VMware ESX Hosts:
4 production VMware ESX 4.0 hosts with NO EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 48 GB memory. These 4 ESX servers are configured to optimally support W2K8 64 bit Microsoft clusters. We currently are running 4 – 2 node MS clusters on these blades. They are using about 37% of the memory and not really touching the CPU, so we could easily double the number of MS clusters over time on these blades.
10 production VMware ESX 4.0 hosts with EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 96 GB memory. Today we have 87 guest servers running on our UCS VMware cluster. This number increases daily. We are preparing for a few application go-lives that use Citrix XenApp to access the application, so we have another 47 of these servers built and ready to be turned on by the end of the month. So we should have well over 127 guest servers running by then on the UCS VMware cluster.
Here is a summary of the types of production applications/workloads that are up the current 87 guest servers:
NOTE: For the 10 guest servers listed below for data warehouse, they are very heavy on memory (3 with 64 GB, etc.) and we have hard allocated this memory to the guest servers. Meaning the guest is assigned and allocated all 64 GB of memory on boot, even if it is not using it. So, for these large servers they are really using memory resources in VMWare differently than what you normally would do within the shared memory function of VMWare.
10 servers running data warehouse app; 5 heavy SQL 2008 64 bit servers with the rest being web and interfaces.
15 servers for Document Management servers running W2K3 server including IBM Websphere.
39 W2K3 64 bit server running Citrix XenApp 4.5 in production delivering our enterprise applications. The combination of these servers is probably handling applications for about 400 concurrent production users. This will be increasing significantly within 21 days with coming go-lives.
7 W2K8 64 bit servers that provide core Citrix XenApp DB function (SQL 2008) and Citrix Provisioning servers for XenApp servers.
1 W2K3 server running SQL 2005 for computer based learning; production for enterprise.
1 W2K3 server running SQL 2005 for production enterprise staff scheduling system.
3 W2K3 servers running general production applications (shared servers for lower end type apps).
3 W2K3 servers running interface processors for the surgical (OR) application (deals with things like collar-bone surgeries🙂 )
1 W2K3 server running a key finance application.
1 W2K3 server running a key pharmacy application.
1 W2K8 server running a pilot SharePoint site (the free version).
There are a few other misc guest servers running as well for various lower end functions, i.e., web servers, etc.
Current VMware Utilization:
Average CPU utilization in the UCS Cluster for the 10 hosts is 8.8%.
The 3 ESX hosts running guest servers with hard allocated 64 GB memory: 76% average.
The 7 ESX hosts running all other workloads: 41% average.
We still have a good amount of growth within our UCS Cluster with 10 servers. I believe I could run this full load on 6 blade servers if I had to for a short period of time.
There you have it, a good summary of what a production Cisco UCS installation looks like in the real world.
Over the years at my organization we often would comment on the quantity of work our IT group is doing to meet the needs of the healthcare organization. We did all of the upgrades for Y2K, expanded with mergers, installed more and more specific systems while not really added FTEs to the mix. Through this time I found the quality of work produced to be very high with a focus on why we are doing what we do; to provide quality healthcare.
Contemplating recently about the successful implementation of Cisco UCS and our clinical data warehouse project it got me thinking about what things went into the success. I have touched on things in this blog about getting funding, short time lines, positive experience with Cisco Advanced Services, the UCS technology and how my staff worked well together.
I think I may have overlooked how important having a staff that works well together can play such a strong role in the success of a UCS implementation. This insight has grown out of a lot of interaction with other organizations around reference calls for Cisco UCS. I have been able to take it for granted skilled engineers with many years of experience in SAN, LAN, servers, virtualization and application delivery who all have the same reporting structure and get along. In addition, management who recognizes individual skills and talents which allows staff to explore new technologies. So I want to give a big thank you to my team for knocking it out of the park with Cisco UCS, you rock!
Yes, Cisco UCS is a very cool, state of the art system for delivering compute capacity but a successful implementation does not just happen. I think UCS can sing when an organization is able to work together as a team to pull all of the components together. Talking with some organizations where the server and network folks do not talk with the storage folks would be a very difficult environment to be successful. On the other hand, putting in a system like Cisco UCS may bring the groups together and may simplify and clarify the need for interactions.
I have had a few readers ask me to comment on a new report from Tolly that was commissioned by HP to compare the network bandwidth scalability between Cisco UCS and the HP BladeSystem c7000. I have not read the report yet, however, on the Blades Made Simple blog (link listed below), there is a brief explanation of the report findings, link to the full report and then some great comments (you have to check them out).
I encourage you to take a look at the comments, they get pretty detailed about the UCS architecture, comparisons to the HP structure, etc. I found the comments from Sean McGee (Cisco data center architect and former a network architect for the HP BladeSystem BU) and then feedback from Ken Henault (HP Infrastructure Architect) a lot of fun to read. You can tell both of these guys are passionate about the technology. Hey, I can’t blame them, this stuff rocks. (I do find it interesting there are a lot of folks at Cisco formerly with the HP BladeSystem group).
My two cents (before reading the actual report, mind you):
As a UCS user, I am not too concerned with the over subscription possibility. In our current production environment we have not seen any issue with bandwidth. We currently are using 16 blades over 2 chassis and within a few weeks we should start using our 3rd chassis and 8 more servers. I will be mindful to watch our bandwidth usage and see if there is any real world problems. I suspect at 24 servers I will not see any issues.
If you want to check out the report, here is the link:
Well the day finally came when I recieved my first group of UCS blades with the new UCS M81KR Virtual Interface Card (VIC) or what has been known as the Palo Card. This is the cool CNA built by Cisco specifically to add a great deal of flexibility to the I/O needs of virtual host servers (ok, mainly focused on VMware ESX 4.x, where all the cool virtualization is happening!).
I should have taken a picture of it! Gone are the Emulex or QLogic stamped name on the mezzanine card. The VIC provides all of the I/O function to the server blade. It is a single card with 2 – 10 GB FCoE ports to the northbound switches and then up to 128 virtual I/O interfaces facing the server/host side.
To be able to manage and build your own customized I/O world for an ESX host or guest machines you have to perform a code upgrade to your UCS system. Once that code upgrade is complete, you see Cisco has added an additional tab in the UCS Manager to be used for configuring the new virtual I/O functions. Note, I have not seen this new tab yet, we are currently planning our code upgrade process. I am interested to see how it goes upgrading the firmware, etc. on a production UCS system. I am sure I will blog about it!
So what does my world currently look like? I have 2 – 6120 Fabric Interconnects, 3 chassis and 25 B200 M1 blade servers (yes, I need to get a 4th chassis to house my 25th blade). 19 of the B200 blades contain the Emulex CNA and 6 B200 blades with the new VIC CNA. I currently have my new “VIC” blades in the chassis but not in use. The UCS manager sees the new blades, can tell me about the VIC, displays the interfaces differently (no virtual vNIC or vHBAs have been created yet).
Stay tune for an update on the code upgrade and screen shots of the new Virtual Tab, etc.
Here is the link at Cisco for details:
I was talking to a group in Chicago over the phone last week about Cisco UCS and they asked the question, “were there any gotchas when we implemented UCS”? I had to stop and think about the question. At first I thought it would be strange if I said no . . . but I could not think of anything that I would consider a gotcha.
In my mind I would define a gotcha as something that came up during implementation that required us to stop and change the way we were going to do something. It would be something significant. From this perspective I drew a blank.
The implementation and the use of UCS is not perfect However, the only issues or stumbling blocks we encountered had to do with either understanding the concepts in the correct context or minor known bugs in the UCS Manager (well they became known to us as we went along🙂 ). Yes, there are a few bugs, like when you click on a vHBA template and try to navigate out of that tab it will prompt you to save your changes. This is every time, even when you did NOT make a change, you have to save it or else you can not leave that page. Or the strange thing that happens every once in a while (~5% of the time) when you have a KVM session to a blade and perform a reboot the screen will stay black. You then have to do a few key strokes (I do not recall what they are right now) to get the KVM to display the actual screen again. I believe both of this items are know to Cisco and probably will be corrected in the next UCS Manager update.
So during my call when I was asked that question, the only thing I could come up with was describing the confusion in terms used for “native VLAN”. On the 6500 it is referred to as native, on the 6120 it is referred to as default and then on the blade service profile it is referred to as native. Once that was understood we could move on.
Another question that is typically asked is how much time is my staff spending on managing UCS on average? Good question . . . so I asked my 4 staff members. Turns out that 2 of the guys have been busy on other projects and have not had the need to go into UCS manager. They have been performing all of their daily and project work in vSphere, no need to get into UCS Manager. They both indicated once it was setup they had no need.
Ok, so I went to my 2 server admins. It turns out the new guy has been in UCS Manager the most, this guys is excited and motivated about his new role. He logs in to check for errors and I have had him open a ticket for a bug that we saw. Other than that it is quite.
Now this week I do have the 2 server admins building 2 blades with Windows 2008 Enterprise to run Oracle for a new application. So they are getting back into the Manager for those tasks. However, from a day to day standpoint there is no more hands on required than any other server or blade system.
Yes, we still need to setup more email and SNMP alerts so we are proactive if and when there is an issue. Those things will come as time permits, etc.
I have contact with several other like organizations in healthcare whom have similar growth and change occurring. As part of an idea sharing that we do I hosted several of my peers a few weeks back to give them an overview of Cisco UCS, why we choose it, how we implemented and how we manage the environment. From these sessions there were a few things that came up with most of the other organizations. I thought I would share some of them.
Code Updates: The question and concern was mentioned what happens when you have to apply new bios or code to the components of UCS, do you have to take down 8-16 servers?
The comparison of UCS to a SAN comes in handy for this question. Currently there are 2 components on a blade that can get updates, the code on the fabric extender (the IO module inside the chassis) and the Fabric Interconnects (6120 switches running UCS Manager).
For the blade itself you can update the firmware on the IO mezzanine card which requires a reboot of that server at your selected time. In addition, the BMC Controller has firmware that can be updated and the last update did not require a reboot.
For the fabric extender and the Fabric Interconnects the redundancy built into the system means you do not have a downtime. Like with a SAN array upgrade you perform your upgrades on the A side, then reboot while everything is functioning on the B side. Then perform the upgrade and reboot on the B side while everything is functioning on the A side. I would still perform this type of activity during second shift but the system would not require a downtime.
What if a chassis fails?
When you look at the chassis, there is not any component that I could see failing that would take out 8 servers. The fans, power supplies, blades and fabric extenders are all redundant. Outside of those items the chassis is just a box. I am sure there are more details to it that someone familiar with the chassis details, but I do not see it as a likely concern.
Concern that UCS is a Generation 1 Product:
My thoughts are yes it is a gen 1 system, however, not fully.
Yes, the Cisco blade server is “new”, however it is using all of the same industry standard components that other server vendor are using; same memory, CPU, etc. For the IO cards the Cisco server is using mezzanine cards which have chips and drivers provided by Emulex and QLogic.
From the infrastructure standpoint the Fabric Interconnect is built on the Cisco Nexus 5000 platform which has been in production for at least 18 months. The Nexus platform has been using converged networking for all of this time.
The major generation 1 component is the added functionality of the UCS Manager on the Fabric Interconnect.
So yes, it is gen 1 but it is taking what has existed and bringing it to the next level.
During one of my meetings we were talking and a peer asked his co-worker, “it is cool, but would we want to take a risk on a gen 1 product?”. I had to jump in and answer his question as “Yes”, it was to compelling to me not to move forward with UCS. At this point, Cisco UCS has streamlined our processes and the way we handle x86 servers that I cannot see going back.
We started UCS with 2 chassis’ and 16 blade servers, however, even before that shipped we ordered an additional chassis and 2 more blades.
My 3rd chassis arrived on the dock today all packaged up nicely with the blades, power cords, etc. I decided to time us to see how long it would take 2 people to get to a point where I can load an OS on the new blade servers.
Myself and another engineer (who was not involved with the racking of the first 2 chassis) took on the task with a camera running (once I figure out how to condense the video into something that looks good I will post it on YouTube).
It took just under 17 minutes for us to unbox, rack, power and cable up the 10 G FCoE connections. Remember there are only 4 cables to install from the chassis fabric extenders to the 6120 fabric interconnect, yes only 4!
Next we went into the UCS manager to enable 2 ports on each fabric interconnect to be used as server ports by the chassis and acknowledge the new chassis. This took all of 2 minutes. At this point we could install an OS on the servers blades, however, I wanted to make sure my firmware versions would be consistant with my first 2 chassis’.
This took us to the update firmware process. My UCS system is currently running version 1.0(2d) and the new chassis items came with version 1.0(1e). The items in the new chassis that contain firmware are the blade server interface cards (CNA), the BMC controller (KVM function, etc.) and the fabric extenders (updates performed in that order). The interface cards and the fabric extenders required a reboot for the new firmware to take effect, the BMC controller did not require a reboot. This process took us 35 minutes to complete.
Note: there is no chassis management card because it is all managed from the 6120 fabric interconnect.
Overall, it took only 54 minutes to go from a box arriving from the dock to 2 new servers ready for an OS! Now that is a cool system.