HealthITGuy's Blog

The evolving world of healthcare IT!

Cisco UCS 1.4 Code: Upgrade is done, this is cool stuff

I am geeked! We just completed the code
upgrade on our production Cisco UCS environment and it was
awesome!

We have been in production on Cisco UCS for 1
year and 22 days now and have ran on 1.1, 1.2, 1.3 and now 1.4
code. So today was our 3rd code upgrade process on a
production environment and each time things have gotten better and
cleaner. Why am I so excited? Think about it . . . with UCS
there are 2 Fabric Interconnects with some chassis hanging off of
them with a bunch of servers all using a single point of
management. Everything is connected with redundancy and if
all of that redundancy is operational and live you truly can reboot
half your IO fabric and not drop a ping or storage connection.
In a storage world this is standard and expected but in a
server blade world you would think to accomplish the same level of
high availability and uptime provided by a SAN, there
would have to be a lot of complexity. Enter Cisco UCS! An
hour ago we upgraded and rebooted half our IO infrastructure that
serves over 208 production VM Guest Servers running on 21 VMware
ESX hosts and another 8 Windows server (all running active SQL or
Oracle databases) blades without dropping a packet. Then I
did the same thing to the other IO infrastructure path with NO
ISSUES. This is just badass. I suspect in a year this
type of redundancy and HA level in a x86 server
environment will be an expectation and not an exception.

UCS Code Upgrade Experiences:

In March 2010 we
performed the first upgrade while in production to 1.2 code (you
can check out my blog post for all the details). The major
impact we experienced with this one was due to a human issue; we
forgot to enable spanning tree port fast for the EtherChannels
connecting our Fabric Interconnects. Our fault, issue fixed,
move on. In December 2010 we implemented 1.3 code for a few reasons
mainly related to ESX 4.1 and Nexus 1K. Our only issue here
was with 1 Windows 2003 64-bit server running on a B200 blade with
OS NIC Teaming which failed to work correctly. Again,
not a UCS code issue but a server OS teaming issue. We had 3
servers using NIC Teaming in the OS, so we decided to change these
servers to hardware failover mode provided in UCS instead of in the
OS. Changes made, ready to move on. It just so happened on
the same day we did the 1.3 upgrade Cisco released 1.4 code just in
time for Christmas (thanks SAVBU). This time we had all our
bases covered and each step worked as expected; no spanning tree
issues, no OS NIC Teaming problems, it was smooth! There was
some risk with moving to the new code so fast, but we have several
projects that are needing the new B230 blades ASAP. There are
several UCS users and partners that have already been going through
1.4 testing and things have been looking very good. Thanks to
all who provided me with feedback over the last week.

New
Features and Functions:

Now we get to dig into all the
new cool and functional features in the new code. I am
impressed already. I will put together a separate posts with
my first impressions. I do want to point out one key thing that I
referenced above; the need to upgrade the infrastructure to use new
hardware (B230 blades). Now that I am on 1.4 code this
requirement is gone. Yep, with 1.4 code, they have made
changes that will NOT require a upgrade of the IO infrastructure
(Fabric Interconnects and UCS Manager) to use new hardware like a
B230. So yes, things are sweet with Cisco UCS and it just got
even better.

Advertisements

December 29, 2010 Posted by | Cisco UCS, General, VMWare | , , | Leave a comment

Victims of Consolidation

We have been spending some time cleaning up in the datacenter pulling out all of the old server hardware that is left over from migrations to a virtual environment.  In this most recent round of cleanup, there are over 60 old physical servers in these stacks which provided a lot of compute cycles for us in the past.  Their time has come to an end.  And to think those 60 workloads can easily run on 2 Cisco UCS B200-M1 blade now with VMware ESX 4.x and EMC PowerPath/VE! 

June 10, 2010 Posted by | General, VMWare | Leave a comment

UCS Upgrade: Step One the Servers

We have been working on the process for performing the upgrade on the full UCS system to go from 1.0 to 1.1 code, which will support the VIC interface card (Palo Card).

Our approach has been to take things one step at a time and to understand the process as we move forward.  We downloaded the 1.1(1l) bin file and release notes and have gone through them.  Pretty basic info in the notes, steps you through each part of the process well and provides enough info to do it.  However, since we are in production we are using caution, I do not want any surprises.

Steps in full UCS Code Upgrade:

1.  Server Interface firmware upgrade, requires a reboot.
2.  Server BMC (KVM) firmware upgrade, reboot is non-disruptive.
3.  Chassis FEX (fabric extender), requires reboot.
4.  UCS Manager, requires reboot.
5.  Fabric Interconnect (6120), requires a reboot.

First step was to open a ticket with Cisco.  This allows us to document everything in the ticket, allows Cisco to use it to pull in the correct resources and gives us a single point of contact.  Next we requested a conference call with TAC and key engineers at Cisco to talk through the process, things to check beforehand, what has worked in their labs, and other customers.

Items like making sure your Ethernet connections northbound (in our case a 6500) have spanning-tree port-fast enabled, confirm your servers are using some type of NIC teaming or have failover enabled on the vNIC in UCSM, and understanding the differences in UCSM between “update firmware” (moves the new code to the device, stages it) and “activate firmware” (moves the new code into startup position so next reboot it takes effect).  Special note, when you activate firmware if you do NOT want the item to automatically reboot you need to check the box for “Set Startup Version Only”.  This will move the new firmware to the “startup” position but not perform the reboot.  When the activation is complete the item will be in “Active Status:  pending-next-reboot”.  However, you do not have this option when doing the BMC (KVM) update.  The BMC update reboot does not effect the server system, so it will reboot once the BMC code is activated.

Most of this information can be found in the release notes and I am sure you don’t want me to just reiterate it to you.  We have been through all of the upgrade steps the first week UCS arrived so I am pretty confident the process will work.  The challenge now is how to do it with a production system.  I am fortunate to have only 17 blades in production.  This leaves me with the 1 spare blade with an Emulex mezz. card and 6 spare blades with the VIC mezz. cards.  This provides me with a test pool of blades (Note: these are only “spare” until we get through the upgrade process, then they will go into the VMware cluster and be put to use).

Goal:  Upgrade UCS as Close to Non-Disruptive as Possible:

Our concern is will a blade function normally running 1.1 code when all the other components are still on 1.0 code?  I suspect it will.  If it does work, the plan is to take our time to update all of the blades over a week or so following the below steps.

This week we updated and activated the firmware on the Interface Cards for all 7 spare blades.  Next we did the same thing for the BMC firmware for the 7 spare blades.

Next step is to put an ESX host into maintenance mode and move the Service Profile to the blade running 1.1 code with an Emulex mezz. card.  We can confirm the ESX host functions and then move some test, dev, and then low end prod servers to this ESX host.  This will allow us to develop a comfort level with the new code.  This step should fit into our schedule early next week.  If we see no issues we will be able to proceed with the process of putting additional ESX hosts into Maintenance Mode, then in UCSM update the firmware, activate firmware and then reboot the blade.  This process will allow the ESX cluster blades get fully updated with no impact on the applications.

For our W2K8 server running vCenter we will perform a backup and schedule a 15 min downtime to reboot it and move the Service Profile to our spare blade running 1.1 code with the Emulex mezz. card.  We can then confirm functionality, etc.  If there are issues we can just move the Service Profile back to the server running 1.0 code and we are back in service.  This same process will be repeated for the 2 – W2K3 servers running on UCS blades.

Summary:

By using the flexibility built into VMware that we all use and love (vMotion, DRS, etc.) and the hardware abstraction provided by the UCS Service Profiles the process of  updating the blade interface and BMC firmware should be straight forward with minimal impact on the end users.  I will give an update next week . . .

March 25, 2010 Posted by | Cisco UCS, UCS Manager, VMWare | , | 2 Comments

Real World UCS: Production Apps Kicking it!

We have been running Cisco UCS 4 months now and are preparing for a code upgrade and adding more B200 blades to the system for VMware.  So I was thinking what do I really have running in production on the system at this point?  It makes sense to have a good handle on this as part of our code upgrade prep work.  I put together the below information and figured others could find it useful to get a perspective of what is in production on UCS in the real world (note all of the blades refer to the B200 model running with the Emulex card).

Windows 2008 and 2003 Servers:

I will start with a cool one. Tuesday we went live with our VMware vCenter server loaded bare metal on a UCS blade with boot from SAN. This is W2K8 64 bit, vCenter 2.x with update manager and running SQL 2008 64 bit database (used by vCenter). It has 1 Nehalem 4 core CPU and 12 GB of memory and is running sweet. This is a big show of trust in UCS, the center of my VMware world running on it for the enterprise!

2 server blades boot from SAN (1 prod and 1 test) running W2K3 64 bit with Oracle ver. 10G for our document management system. It has 1 Nehalem 4 core CPU and 48 GB of memory and is running with no issues.

VMware ESX Hosts:

4 production VMware ESX 4.0 hosts with NO EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 48 GB memory. These 4 ESX servers are configured to optimally support W2K8 64 bit Microsoft clusters. We currently are running 4 – 2 node MS clusters on these blades. They are using about 37% of the memory and not really touching the CPU, so we could easily double the number of MS clusters over time on these blades.

10 production VMware ESX 4.0 hosts with EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 96 GB memory. Today we have 87 guest servers running on our UCS VMware cluster. This number increases daily. We are preparing for a few application go-lives that use Citrix XenApp to access the application, so we have another 47 of these servers built and ready to be turned on by the end of the month. So we should have well over 127 guest servers running by then on the UCS VMware cluster.

Here is a summary of the types of production applications/workloads that are up the current 87 guest servers:

NOTE: For the 10 guest servers listed below for data warehouse, they are very heavy on memory (3 with 64 GB, etc.) and we have hard allocated this memory to the guest servers. Meaning the guest is assigned and allocated all 64 GB of memory on boot, even if it is not using it. So, for these large servers they are really using memory resources in VMWare differently than what you normally would do within the shared memory function of VMWare.

10 servers running data warehouse app; 5 heavy SQL 2008 64 bit servers with the rest being web and interfaces.

15 servers for Document Management servers running W2K3 server including IBM Websphere.

39 W2K3 64 bit server running Citrix XenApp 4.5 in production delivering our enterprise applications. The combination of these servers is probably handling applications for about 400 concurrent production users. This will be increasing significantly within 21 days with coming go-lives.

7 W2K8 64 bit servers that provide core Citrix XenApp DB function (SQL 2008) and Citrix Provisioning servers for XenApp servers.

1 W2K3 server running SQL 2005 for computer based learning; production for enterprise.

1 W2K3 server running SQL 2005 for production enterprise staff scheduling system.

3 W2K3 servers running general production applications (shared servers for lower end type apps).

3 W2K3 servers running interface processors for the surgical (OR) application (deals with things like collar-bone surgeries 🙂 )

1 W2K3 server running a key finance application.

1 W2K3 server running a key pharmacy application.

1 W2K8 server running a pilot SharePoint site (the free version).

There are a few other misc guest servers running as well for various lower end functions, i.e., web servers, etc.

Current VMware Utilization:

Average CPU utilization in the UCS Cluster for the 10 hosts is 8.8%.

Memory usage:

The 3 ESX hosts running guest servers with hard allocated 64 GB memory: 76% average.

The 7 ESX hosts running all other workloads: 41% average.

We still have a good amount of growth within our UCS Cluster with 10 servers. I believe I could run this full load on 6 blade servers if I had to for a short period of time.

There you have it, a good summary of what a production Cisco UCS installation looks like in the real world.

March 18, 2010 Posted by | Cisco UCS, VMWare | , , | 5 Comments