Well that was a great day! When I was invited to represent my organization at the Cisco Datacenter Launch to talk about our UCS experience I was humbled, excited and nervous. However, it is not often in someone’s career to have the opportunity to be included on a panel with such innovative leaders in the technology industry as David Lawler, Soni Jiandani, Boyd Davis and Ben Gibson. Everyone was down to earth, personable and very comfortable to work with on the panel. The goal was to make the event a relaxed discussion and the point of view of the customer was truly important to the panel and Cisco. I was also amazed at how many people pull the details together for an event of this nature. Lynn, Janne and Marsha were great making sure I was prepared and helped make everything go off smoothly.
This customer focus continued to be evident after the video was completed. I was able to spend the rest of the day with many key individuals, who made time for me, from the UCS business unit. We had some deep technical discussions on various topics like firmware upgrades, wish lists, directions, ease of use vs. levels of control, etc. I was asked by everyone for input regarding ways to improve as well as talking about how we are using the system.
To end my day on the Cisco campus, David Lawyer invited me to his office to meet with him and Mario Mazzola, Senior Vice President of Server and Virtualization Business Unit (SAVBU). Mario has been a key technology person in Silicon Valley, leading the creation of the 6500 switch product, the Cisco MDS fiber channel product and now the Cisco UCS platform (along with many other accomplishments). I think it is fair to say he is a legend in the industry (however, my impression is he is very humble and quickly acknowledge’s others for their contributions to the projects). We had a conversation focused on the customer views related to the product and how it is Cisco’s goal to continual improve the system. Mario and David are very down to earth people and it was clear to me Cisco is very customer focus from the top of the organization down.
So that’s a wrap for this trip to Cisco in San Jose for now . . .
In my last post regarding the UCS firmware upgrade process I described our proposed upgrade steps for the FEX modules and the Fabric Interconnects. At the time the best information I had gathered indicated (release notes, TAC, etc.) the possibility of one minute interruption of I/O.
One of the great things about coming out to Cisco in San Jose is to be able to talk to the people who build this stuff in the business unit. I will have more in-depth discussions tomorrow, but the quick info is some clarification on the “one minute interruption of I/O”. It turns out this is really only referring to the Ethernet side of the I/O. Your fiber channel storage area network will not have any disruption because of the multipath nature of FC, so no SAN connectivity disruption. Next on the Ethernet side it sounds like if there is a disruption it really comes down to the MAC tables on the fabric interconnects and a few other failover functions. The smaller the MAC tables the quicker the “failover”.
Like I said, I will get more details tomorrow but wanted to get this new info out to everyone. I am feeling pretty good about our upgrade next week.
Our plan is to have the blade firmware (interface and BMC) upgrade completed by next Wed. After that it will be time to move up the stack to the I/O devices and manager. Based on schedules these items will probably be schedule in about 2 weeks.
Impact for the last 3 items:
Based on how the system works, if it functions as designed, we should experience up to one minute loss of connectivity or downtime. Explanation of the process:
A chassis contains 2 fabric extenders on an A and B side (called FEX or I/O modules). These are connected to the switch called the fabric interconnect (FI or 6120). The UCS Manager (UCSM) tool resides on the 6120s and runs as an active/passive cluster.
Summary of Steps for I/O Upgrade:
1. Confirm all servers have some form of NIC teaming functioning so they can handle the loss of one I/O path.
2. Confirm the northbound Ethernet connection between the FI (6120) and the 6513 switch have spanning-tree portfast enabled.
3. Pre-stage the new firmware on all 6 (2 per chassis) FEX modules to be applied on next reboot.
4. Pre-stage the new firmware on both of the Fabric Interconnects (6120) to be applied on next reboot (I am trying to confirm that you can “activate in startup version” on the 6120).
5. Reboot only the “passive” Fabric Interconnect (6120). Note the FEX modules are an extension of the FI (6120), so when the FI is rebooted the connected FEX modules reboot at the same time (meaning all of the B side I/O path is rebooted at the same time). It can take up to 10 minutes for the full reboot process to complete. During this time all workload traffic should be functioning via the remaining I/O path.
6. Once step 5 is fully completed perform a reboot on the remaining FI (6120). This should be done soon after confirming step 5 has completed. You want to keep both FI (6120) running the same code version when at all possible.
Again we are told the above steps will most likely result in up to one minute interruption of I/O to the blade servers. That is if all functions as designed.
Summary of Steps for UCSM Upgrade (Post I/O upgrade):
1. Perform the UCSM upgrade on the passive FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
2. Perform the UCSM upgrade on the active FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
The UCSM can be upgraded and restarted without effecting applications running on UCS; it does not impact I/O functions.
Putting this into perspective . . . what is the disruption when you have to update code on an Ethernet switch or Fiber Channel switch?
I am real curious to find out if we do have about 1 minute of I/O disruption. Logically, I would think there is no disruption if you have some level of NIC teaming at the blade and you upgrade the B and then A side. To be continued . . .
We are making progress with the server component of the upgrade process.
Our spare blade with an Emulex mezz. card running 1.1(1l) code for the interface and BMC was ready for a real workload. We put one of our production ESX hosts into maintenance mode and powered off the blade. Then we unassociated it and associated it to the physical blade running the 1.1(1l) code with the same type of Emulex card. The server booted with no issues, we confirmed all was normal and after about 30 minutes we had it back in the live cluster.
We repeated this process on 2 additional ESX hosts and now are running with 3 of the 10 ESX servers in the UCS cluster with the new firmware with no issues. The plan is to do several more tomorrow, maybe the rest of them. Very positive results.
Two ways to update endnode firmware (meaning the blade):
As I was reading through the “how to upgrade UCS…” release notes I recalled some early discussions when I was looking at purchasing UCS. There are 2 ways to update the firmware on the Interface and BMC, etc. We have been using the method of the UCSM tool to go to the physical blade and update it at this level. The other way is via the Service Profile Host Firmware Package policy.
This makes it pretty interesting once you think about it. Instead of thinking of firmware by hardware you think of it as the workload (Service Profile). Lets say my W2K3 server interface can only run on 1.0(2b) firmware and I need to make sure that regardless of the physical blade it is running on that the correct firmware is there. By using a Service Profile firmware policy you can make that happen. So when you move the W2K3 workload from chassis 1 blade 3 to chassis 3 blade 7 the Service Profile (via the Cisco Utility OS) drops in the approved firmware version. Pretty cool to think about.
Note there is at least one drawback to the Service Profile approach. This firmware policy is auto updating, so if you make a change in firmware version it will automatically apply the change and restart the server. This means you have to be careful in how you use this as a means to perform the updates. (when doing firmware updates via UCSM you DO have the ability to control when you reboot the workload).
As part of our firmware upgrade process we will be rebooting the blades for the firmware to take effect. So as part of my prep process I have done some time measures today to get a feel for how long it may take to accomplish our tasks.
I measured times for VMware, W2K3 and the Cisco Utility OS boot times.
ESX 4.0 host:
VMware function: Put into Maint Mode (9 up srv & 8 powered off srv): 3 min. 18 sec.
VMware function: Power down ESX host via vCenter: 45 sec.
UCS function: Click Boot Server until ping and console appear: 4 min. 10 sec.
VMware function: Time for ESX to fully become available in vCenter: 3 min. 13 sec. additional
Windows 2003 Server:
UCS function: Click Boot Server until Windows 2003 splash Screen: 2 min. 16 sec.
W2K3 function: Time between splash screen and login prompt: 54 sec.
W2K3 function: 38 sec.
Cisco Utility OS, i.e., Service Profile Association:
For any server to run on UCS you first have to associate a Service Profile to a physical blade. This is the process in which the hardware abstraction is done by running the Cisco Utility OS on the physical blade. This process does take some time.
UCS function: Associate/Disassociate Service Profile to blade: 5 min. 6 sec.
Note the time to Associate/Disassociate Service Profile to a Blade is something that is only ran when first associating these 2 items or when a change in the Service Profile is done. This does NOT run on every boot.
I had blogged how clean the installation of a chassis is with needing only 4 cables for I/O and 4 power cables. Via twitter someone asked for a pic of this and I realized I did not have one posted. I have added this pic to my Picasa page that contains about 50 UCS pictures (it is the last one).
Note how open the airflow is when you only have a few cables to deal with. You can see they were paying attention to this detail on the fabric extenders and where the power plugs into; all of the openings.
You can also see I am only using 2 of the 4 possible ports in the fabric extenders (FEX). The options being 1, 2 or 4 connections per each FEX.
We have been running Cisco UCS 4 months now and are preparing for a code upgrade and adding more B200 blades to the system for VMware. So I was thinking what do I really have running in production on the system at this point? It makes sense to have a good handle on this as part of our code upgrade prep work. I put together the below information and figured others could find it useful to get a perspective of what is in production on UCS in the real world (note all of the blades refer to the B200 model running with the Emulex card).
Windows 2008 and 2003 Servers:
I will start with a cool one. Tuesday we went live with our VMware vCenter server loaded bare metal on a UCS blade with boot from SAN. This is W2K8 64 bit, vCenter 2.x with update manager and running SQL 2008 64 bit database (used by vCenter). It has 1 Nehalem 4 core CPU and 12 GB of memory and is running sweet. This is a big show of trust in UCS, the center of my VMware world running on it for the enterprise!
2 server blades boot from SAN (1 prod and 1 test) running W2K3 64 bit with Oracle ver. 10G for our document management system. It has 1 Nehalem 4 core CPU and 48 GB of memory and is running with no issues.
VMware ESX Hosts:
4 production VMware ESX 4.0 hosts with NO EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 48 GB memory. These 4 ESX servers are configured to optimally support W2K8 64 bit Microsoft clusters. We currently are running 4 – 2 node MS clusters on these blades. They are using about 37% of the memory and not really touching the CPU, so we could easily double the number of MS clusters over time on these blades.
10 production VMware ESX 4.0 hosts with EMC PowerPath/VE. All boot from SAN, 2 – 4 Core CPU and 96 GB memory. Today we have 87 guest servers running on our UCS VMware cluster. This number increases daily. We are preparing for a few application go-lives that use Citrix XenApp to access the application, so we have another 47 of these servers built and ready to be turned on by the end of the month. So we should have well over 127 guest servers running by then on the UCS VMware cluster.
Here is a summary of the types of production applications/workloads that are up the current 87 guest servers:
NOTE: For the 10 guest servers listed below for data warehouse, they are very heavy on memory (3 with 64 GB, etc.) and we have hard allocated this memory to the guest servers. Meaning the guest is assigned and allocated all 64 GB of memory on boot, even if it is not using it. So, for these large servers they are really using memory resources in VMWare differently than what you normally would do within the shared memory function of VMWare.
10 servers running data warehouse app; 5 heavy SQL 2008 64 bit servers with the rest being web and interfaces.
15 servers for Document Management servers running W2K3 server including IBM Websphere.
39 W2K3 64 bit server running Citrix XenApp 4.5 in production delivering our enterprise applications. The combination of these servers is probably handling applications for about 400 concurrent production users. This will be increasing significantly within 21 days with coming go-lives.
7 W2K8 64 bit servers that provide core Citrix XenApp DB function (SQL 2008) and Citrix Provisioning servers for XenApp servers.
1 W2K3 server running SQL 2005 for computer based learning; production for enterprise.
1 W2K3 server running SQL 2005 for production enterprise staff scheduling system.
3 W2K3 servers running general production applications (shared servers for lower end type apps).
3 W2K3 servers running interface processors for the surgical (OR) application (deals with things like collar-bone surgeries )
1 W2K3 server running a key finance application.
1 W2K3 server running a key pharmacy application.
1 W2K8 server running a pilot SharePoint site (the free version).
There are a few other misc guest servers running as well for various lower end functions, i.e., web servers, etc.
Current VMware Utilization:
Average CPU utilization in the UCS Cluster for the 10 hosts is 8.8%.
The 3 ESX hosts running guest servers with hard allocated 64 GB memory: 76% average.
The 7 ESX hosts running all other workloads: 41% average.
We still have a good amount of growth within our UCS Cluster with 10 servers. I believe I could run this full load on 6 blade servers if I had to for a short period of time.
There you have it, a good summary of what a production Cisco UCS installation looks like in the real world.
I have had a few readers ask me to comment on a new report from Tolly that was commissioned by HP to compare the network bandwidth scalability between Cisco UCS and the HP BladeSystem c7000. I have not read the report yet, however, on the Blades Made Simple blog (link listed below), there is a brief explanation of the report findings, link to the full report and then some great comments (you have to check them out).
I encourage you to take a look at the comments, they get pretty detailed about the UCS architecture, comparisons to the HP structure, etc. I found the comments from Sean McGee (Cisco data center architect and former a network architect for the HP BladeSystem BU) and then feedback from Ken Henault (HP Infrastructure Architect) a lot of fun to read. You can tell both of these guys are passionate about the technology. Hey, I can’t blame them, this stuff rocks. (I do find it interesting there are a lot of folks at Cisco formerly with the HP BladeSystem group).
My two cents (before reading the actual report, mind you):
As a UCS user, I am not too concerned with the over subscription possibility. In our current production environment we have not seen any issue with bandwidth. We currently are using 16 blades over 2 chassis and within a few weeks we should start using our 3rd chassis and 8 more servers. I will be mindful to watch our bandwidth usage and see if there is any real world problems. I suspect at 24 servers I will not see any issues.
If you want to check out the report, here is the link:
Well the day finally came when I recieved my first group of UCS blades with the new UCS M81KR Virtual Interface Card (VIC) or what has been known as the Palo Card. This is the cool CNA built by Cisco specifically to add a great deal of flexibility to the I/O needs of virtual host servers (ok, mainly focused on VMware ESX 4.x, where all the cool virtualization is happening!).
I should have taken a picture of it! Gone are the Emulex or QLogic stamped name on the mezzanine card. The VIC provides all of the I/O function to the server blade. It is a single card with 2 – 10 GB FCoE ports to the northbound switches and then up to 128 virtual I/O interfaces facing the server/host side.
To be able to manage and build your own customized I/O world for an ESX host or guest machines you have to perform a code upgrade to your UCS system. Once that code upgrade is complete, you see Cisco has added an additional tab in the UCS Manager to be used for configuring the new virtual I/O functions. Note, I have not seen this new tab yet, we are currently planning our code upgrade process. I am interested to see how it goes upgrading the firmware, etc. on a production UCS system. I am sure I will blog about it!
So what does my world currently look like? I have 2 – 6120 Fabric Interconnects, 3 chassis and 25 B200 M1 blade servers (yes, I need to get a 4th chassis to house my 25th blade). 19 of the B200 blades contain the Emulex CNA and 6 B200 blades with the new VIC CNA. I currently have my new “VIC” blades in the chassis but not in use. The UCS manager sees the new blades, can tell me about the VIC, displays the interfaces differently (no virtual vNIC or vHBAs have been created yet).
Stay tune for an update on the code upgrade and screen shots of the new Virtual Tab, etc.
Here is the link at Cisco for details:
I was talking to a group in Chicago over the phone last week about Cisco UCS and they asked the question, “were there any gotchas when we implemented UCS”? I had to stop and think about the question. At first I thought it would be strange if I said no . . . but I could not think of anything that I would consider a gotcha.
In my mind I would define a gotcha as something that came up during implementation that required us to stop and change the way we were going to do something. It would be something significant. From this perspective I drew a blank.
The implementation and the use of UCS is not perfect However, the only issues or stumbling blocks we encountered had to do with either understanding the concepts in the correct context or minor known bugs in the UCS Manager (well they became known to us as we went along ). Yes, there are a few bugs, like when you click on a vHBA template and try to navigate out of that tab it will prompt you to save your changes. This is every time, even when you did NOT make a change, you have to save it or else you can not leave that page. Or the strange thing that happens every once in a while (~5% of the time) when you have a KVM session to a blade and perform a reboot the screen will stay black. You then have to do a few key strokes (I do not recall what they are right now) to get the KVM to display the actual screen again. I believe both of this items are know to Cisco and probably will be corrected in the next UCS Manager update.
So during my call when I was asked that question, the only thing I could come up with was describing the confusion in terms used for “native VLAN”. On the 6500 it is referred to as native, on the 6120 it is referred to as default and then on the blade service profile it is referred to as native. Once that was understood we could move on.
Another question that is typically asked is how much time is my staff spending on managing UCS on average? Good question . . . so I asked my 4 staff members. Turns out that 2 of the guys have been busy on other projects and have not had the need to go into UCS manager. They have been performing all of their daily and project work in vSphere, no need to get into UCS Manager. They both indicated once it was setup they had no need.
Ok, so I went to my 2 server admins. It turns out the new guy has been in UCS Manager the most, this guys is excited and motivated about his new role. He logs in to check for errors and I have had him open a ticket for a bug that we saw. Other than that it is quite.
Now this week I do have the 2 server admins building 2 blades with Windows 2008 Enterprise to run Oracle for a new application. So they are getting back into the Manager for those tasks. However, from a day to day standpoint there is no more hands on required than any other server or blade system.
Yes, we still need to setup more email and SNMP alerts so we are proactive if and when there is an issue. Those things will come as time permits, etc.