Here are a few things that have jumped out so far that I like with the 1.4 code:
– Support for the new B230 blades; half size blade with 2 sockets and 32 DIMM slots (16 cores and 256 GB memory!!)
– Able to manage the C-Series servers via UCS Manager (the rack servers)
– Power capping to cover groups of chassis; this is very powerful now. Think about it, you can have 4 chassis in 1 rack all sharing 2 or 4 – 220 circuits. Now you can cap, monitor and manage the amount of power by groups of chassis not just per blade or chassis.
– Software packaging for new server hardware that does NOT require the IO fabric upgrade to use the new servers. Nice!
– Bigger and Badder! Support for up to 1024 VLANS and up to 20 UCS chassis (160 servers from 1 management point!!).
– Fiber Channel connectivity options, now can do port channeling and FC trucking as well as some limited direct connection of FC based storage (no zoning ability . . yet).
OK the list goes on and on, they have packed a lot into this release.
Checking out the new items in UCSM, I had to grab a few screen shots of the following:
Power! You ever wonder how much power a server or chassis is using? Now you know, check this out! I am loving this.
For those UCS users out there, it has not always been very clear what the impact of making various changes to a Service Profile might do to the workload. They have improved with each release, but this is some great detail now:
Cool stuff in Cisco UCS 1.4 code, I hope to have more time to share with everyone as we continue to maximize our investment. Time to go home . . .
Here is a cool new feature in the UCSM 1.2.1b code that is going to come in handy . You get to the Cisco UCS Manager (UCSM) via a web browser. Now when you hit the main web page you have the option to run the UCSM or something new called the UCS – KVM Launch Manager.
So why is UCS – KVM Launch Manager a cool thing?
We run a few Windows servers directly on UCS b M1 blades. The system administrators of those boxes have to connect to them from time to time, this is typically done with a RDC connection with the -console switch. If there is a problem with that approach they would need to log into UCSM and connect to the KVM. Now, if the KVM access is all that system admin needs for that server I can have them use the UCS KVM Launch Manager and they can launch a KVM session from a secure web page using their AD login. This is a nice new feature.
There was a lot of interest in my last post about our code upgrade. What stood out to many (as well as us) was the 24 second disruption in network traffic on each fabric interconnect side when it was rebooted. Meaning I upgrade the B side and when that fabric interconnect goes into its reboot all of the Ethernet traffic (remember the fiber channel traffic had NO issues) that was talking on the B side is disrupted for 24 seconds.
The answer was in my core 6513 configuration regarding spanning tree. I would like to thank Jeremiah at Varrow and the guys at Cisco who helped us figure this out.
Turns out that one of the first configuration confirmation items in the code upgrade process (really it should have been setup all along . . .) was making sure the port channels that the fabric interconnects are connected to are set with spanning-tree portfast trunk. An email was sent to get this confirmed and configured but it got missed, to bad it was not in the Cisco Pre-Requisite document as a reminder. What this command gives you is if and when the trunk port link to the fabric interconnect goes away for any reason the 6513 will not go through the normal spanning tree timers and quickly allow the traffic to flow on the remaining path (in our case the remaining connection to the fabric interconnects).
We have now enabled spanning-tree portfast trunk on our port channels and should be positioned now to eliminate that pesky 24 second Ethernet disruption that impacted some of the traffic. Details, details!
I have been blogging for a while about the planned code upgrade to our production UCS environment for a while now and we finally cut over to 1.2(1b) on all the system components! Success. Here is a quick run down.
We decided to go with the 1.2(1b) code because the main difference between it and 1.1(1l) was the support for the new Nehalem CPU that will be available in the B200 and B250 M2 blades in about a month. We want to start running with these new CPUs this summer; more cores means more guests and lower cost.
The documentation from Cisco on the process was pretty good and provided great step by step instructions and what to expect. We followed it closely and did not have any issues, all worked as expected.
Here is how we did it:
First step was to perform a backup of the UCS configuration (you always want a fall back, but we did not need it).
We started with upgrading the BMC on each blade via the Firmware Management Tab; this does not disrupt the servers and was done during the day in about 30 minutes. We took it slow on the first 8 BMC and then did a batch job for the last 8.
At 6 PM we ran through the “Prerequisite to Upgrade . . .” document a second time to confirm all components were healthy and ready for an upgrade; no issues. Next we confirmed that all HBA multipath software was healthy seeing all 4 paths as well as confirmed NIC teaming was healthy; no issues.
At 6:30 PM we pre-staged the new code on the FEX (IO modules) in each chassis. This meant we clicked “Set Startup Version Only” for all 6 modules (2 per chassis times 3). Because we checked the box for “Set Startup Version Only” there was NO disruption of any servers, nothing is rebooted at this time.
At 6:50 PM we performed the upgrade to the UCS Manager software which is a matter of activating it via the Firmware Management tab. No issues and it took less than 5 minutes. We were able to login and perform the remaining tasks listed below when it was complete. Note, this step does NOT disrupt any server functions, everything continues to work normally.
At 7:00 PM, it was time for the stressful part, the activation of the new code on the fabric interconnects which results in a reboot of the subordinate side of the UCS system (or the B side in my case). To prepare for this step we did a few things because all the documentation indicated there can be “up to a minute disruption” of network connectivity (it does NOT impact the storage I/O; fiber channel protocol and multipath takes care of it) during the reboot. This disruption is related to the arp-cache on the fabric interconnects I believe, here is what we experienced.
UCS fabric interconnect A is connected to the 6513 core Ethernet switch port group 29. UCS fabric interconnect B is connected to the 6513 core Ethernet switch port group 30. During normal functioning the traffic is pretty balanced between the two port groups about 60/40.
My assumption was that when the B side goes down for the reboot, we would flush the arp-cache for port group 30 and then the 6513 will quickly re-learn all the MAC addresses now reside on port group 29. Well, it did not actually work like that . . . when the B side rebooted the 6513 cleared the arp-cache on port group 30 right away on its own and it took about 24 seconds (yes I was timing it) for the disrupted traffic to start flowing via port group 29 (the A side). Once the B side finished its reboot in 11 minutes (the documentation indicated 10 mins.) traffic automatically began flowing through both the A and B sides again as normal.
So what was happening for the 24 seconds? I suspect it was the arp-cache on the A side fabric interconnect knew all the MACs that were talking on the B side so it would not pass that traffic until it timed out and relearned.
As I have posted previously we run our vCenter on a UCS blade using NIC teaming. I had confirmed vCenter was talking to the network on the A side, so after we experienced the 24 second disruption on the B side I forced my vCenter traffic to the B side before rebooting the A side. This way we did not drop any packets to vCenter (did this by disabling the NIC in the OS that was connected to the A side and let NIC teaming use only the B side).
This approach worked great for vCenter, we did not lose connectivity when the A side was rebooted. However, I should have followed this same approach with all of my ESX hosts because most of them were talking on the A side. The VMware HA did not like having the 27 second disruption and was confused afterwards for a while (however, full HA did NOT kick in). All of the hosts came back, as well as all of the guests except for 3. 1 test server, 1 Citrix Provisioning server and 1 database server had to be restarted due to the disruption in network traffic (again the storage I/O was NOT disrupted; mutlipath worked great).
Overall it went very well and we are pleased with the results. Our remaining tasks are to apply the blade bios updates to the rest of the blades (we did 5 of them tonight) using the Service Profile policy — Host Firmware Packages. These will be done by putting each ESX into Maintenance Mode and rebooting the blade. It takes about 2 utility boots for it to take effect or about 10 minutes each server. Should have this done by Wednesday.
What I liked:
– You have control of each step, as the admin you get to decide when to reboot components.
– You can update each item one at a time or in batches as your comfort level allows.
– The documentation was correct and accurate.
What can be improved:
– Need to eliminate the 24 to 27 second Ethernet disruption which is probably due to the arp-cache. Cisco has added a “MAC address Table Aging” setting in the Equipment Global Policies area, maybe this already addresses it.
Our plan is to have the blade firmware (interface and BMC) upgrade completed by next Wed. After that it will be time to move up the stack to the I/O devices and manager. Based on schedules these items will probably be schedule in about 2 weeks.
Impact for the last 3 items:
Based on how the system works, if it functions as designed, we should experience up to one minute loss of connectivity or downtime. Explanation of the process:
A chassis contains 2 fabric extenders on an A and B side (called FEX or I/O modules). These are connected to the switch called the fabric interconnect (FI or 6120). The UCS Manager (UCSM) tool resides on the 6120s and runs as an active/passive cluster.
Summary of Steps for I/O Upgrade:
1. Confirm all servers have some form of NIC teaming functioning so they can handle the loss of one I/O path.
2. Confirm the northbound Ethernet connection between the FI (6120) and the 6513 switch have spanning-tree portfast enabled.
3. Pre-stage the new firmware on all 6 (2 per chassis) FEX modules to be applied on next reboot.
4. Pre-stage the new firmware on both of the Fabric Interconnects (6120) to be applied on next reboot (I am trying to confirm that you can “activate in startup version” on the 6120).
5. Reboot only the “passive” Fabric Interconnect (6120). Note the FEX modules are an extension of the FI (6120), so when the FI is rebooted the connected FEX modules reboot at the same time (meaning all of the B side I/O path is rebooted at the same time). It can take up to 10 minutes for the full reboot process to complete. During this time all workload traffic should be functioning via the remaining I/O path.
6. Once step 5 is fully completed perform a reboot on the remaining FI (6120). This should be done soon after confirming step 5 has completed. You want to keep both FI (6120) running the same code version when at all possible.
Again we are told the above steps will most likely result in up to one minute interruption of I/O to the blade servers. That is if all functions as designed.
Summary of Steps for UCSM Upgrade (Post I/O upgrade):
1. Perform the UCSM upgrade on the passive FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
2. Perform the UCSM upgrade on the active FI (6120) device, it will “restart or reboot” the UCSM process NOT the FI (6120).
The UCSM can be upgraded and restarted without effecting applications running on UCS; it does not impact I/O functions.
Putting this into perspective . . . what is the disruption when you have to update code on an Ethernet switch or Fiber Channel switch?
I am real curious to find out if we do have about 1 minute of I/O disruption. Logically, I would think there is no disruption if you have some level of NIC teaming at the blade and you upgrade the B and then A side. To be continued . . .
We are making progress with the server component of the upgrade process.
Our spare blade with an Emulex mezz. card running 1.1(1l) code for the interface and BMC was ready for a real workload. We put one of our production ESX hosts into maintenance mode and powered off the blade. Then we unassociated it and associated it to the physical blade running the 1.1(1l) code with the same type of Emulex card. The server booted with no issues, we confirmed all was normal and after about 30 minutes we had it back in the live cluster.
We repeated this process on 2 additional ESX hosts and now are running with 3 of the 10 ESX servers in the UCS cluster with the new firmware with no issues. The plan is to do several more tomorrow, maybe the rest of them. Very positive results.
Two ways to update endnode firmware (meaning the blade):
As I was reading through the “how to upgrade UCS…” release notes I recalled some early discussions when I was looking at purchasing UCS. There are 2 ways to update the firmware on the Interface and BMC, etc. We have been using the method of the UCSM tool to go to the physical blade and update it at this level. The other way is via the Service Profile Host Firmware Package policy.
This makes it pretty interesting once you think about it. Instead of thinking of firmware by hardware you think of it as the workload (Service Profile). Lets say my W2K3 server interface can only run on 1.0(2b) firmware and I need to make sure that regardless of the physical blade it is running on that the correct firmware is there. By using a Service Profile firmware policy you can make that happen. So when you move the W2K3 workload from chassis 1 blade 3 to chassis 3 blade 7 the Service Profile (via the Cisco Utility OS) drops in the approved firmware version. Pretty cool to think about.
Note there is at least one drawback to the Service Profile approach. This firmware policy is auto updating, so if you make a change in firmware version it will automatically apply the change and restart the server. This means you have to be careful in how you use this as a means to perform the updates. (when doing firmware updates via UCSM you DO have the ability to control when you reboot the workload).
Driving into work today I realized I left out some key items regarding upgrading code in UCS Manager (UCSM).
One of the things that really makes UCS different is the central UCSM running on chips in the Fabric Interconnects (6120). Having one location to control and manage the full system really simplifies all management tasks, including firmware updates. Ok, that sounds great from a design and marketing perspective but here is the everyday “does it make your job easier” side.
1. All firmware updates are done in one location: Equipment — Firmware Management. This provides a tree structure displaying all components, running version, startup version and backup version. You can filter this view any way you like to only see the components you are interested in. So to see which blades are running x firmware is a quick process.
NOTE: You will see in the screen shot Server 2, 3 and 4 are each in a different state regarding firmware. Server 2 is live running a W2K3 workload with 1.0(2d) version and 1.0(1e) as the backup. Server 3 is in the state of having 1.1(1l) as Startup Version with the Activate Status as pending-next-boot. Then Server 4 is running 1.1(1l) code with 1.0(2d) version as backup. Server 4 is ready for an ESX Service Profile to be moved to it for testing next week.
2. Ability to perform firmware updates/changes to one or many components. Since UCSM gives you a full view of the system you can either select an individual component (blade 2 in chassis 3) or select many (all blades in chassis 1) to perform a firmware task. This worked well for us. After doing the pre-stage (called Update Firmware) and activation of new firmware on 2 individual blades we were comfortable with performing a group pre-stage and activation to the remaining 5 spare blades. This is a parallel process, so it is about the same amount of time to update 1 blade or 10 blades. Also, the UCSM interface is very good at automatically updating where things are in the process via the Server FSM tab or on the General tab — Status Details.
3 Ability to Pre-Stage (Update Firmware) without applying the new firmware. This goes hand in hand with the point above. You can get your new code on all of the components without impacting the servers. I like this ability because it allows you to really control each step of the process and visually see your current state.
4. Ability to move new firmware to “Startup Version” allowing you to control when the new code takes effect. This step is done under the Activate Firmware task. You have the option to either have the activation process only place the new firmware as the Startup Version or it can automatically also reboot the device/component so the new firmware goes into use. This step is nice because you as the admin get to decide how you want to “activate” the change. You can choose a one by one, slow, methodical approach or a quick rollout process based on your organization’s needs.
I hope this information helps with understanding some fundamental differences you have with firmware updates in Cisco UCS.
We have been working on the process for performing the upgrade on the full UCS system to go from 1.0 to 1.1 code, which will support the VIC interface card (Palo Card).
Our approach has been to take things one step at a time and to understand the process as we move forward. We downloaded the 1.1(1l) bin file and release notes and have gone through them. Pretty basic info in the notes, steps you through each part of the process well and provides enough info to do it. However, since we are in production we are using caution, I do not want any surprises.
Steps in full UCS Code Upgrade:
1. Server Interface firmware upgrade, requires a reboot.
2. Server BMC (KVM) firmware upgrade, reboot is non-disruptive.
3. Chassis FEX (fabric extender), requires reboot.
4. UCS Manager, requires reboot.
5. Fabric Interconnect (6120), requires a reboot.
First step was to open a ticket with Cisco. This allows us to document everything in the ticket, allows Cisco to use it to pull in the correct resources and gives us a single point of contact. Next we requested a conference call with TAC and key engineers at Cisco to talk through the process, things to check beforehand, what has worked in their labs, and other customers.
Items like making sure your Ethernet connections northbound (in our case a 6500) have spanning-tree port-fast enabled, confirm your servers are using some type of NIC teaming or have failover enabled on the vNIC in UCSM, and understanding the differences in UCSM between “update firmware” (moves the new code to the device, stages it) and “activate firmware” (moves the new code into startup position so next reboot it takes effect). Special note, when you activate firmware if you do NOT want the item to automatically reboot you need to check the box for “Set Startup Version Only”. This will move the new firmware to the “startup” position but not perform the reboot. When the activation is complete the item will be in “Active Status: pending-next-reboot”. However, you do not have this option when doing the BMC (KVM) update. The BMC update reboot does not effect the server system, so it will reboot once the BMC code is activated.
Most of this information can be found in the release notes and I am sure you don’t want me to just reiterate it to you. We have been through all of the upgrade steps the first week UCS arrived so I am pretty confident the process will work. The challenge now is how to do it with a production system. I am fortunate to have only 17 blades in production. This leaves me with the 1 spare blade with an Emulex mezz. card and 6 spare blades with the VIC mezz. cards. This provides me with a test pool of blades (Note: these are only “spare” until we get through the upgrade process, then they will go into the VMware cluster and be put to use).
Goal: Upgrade UCS as Close to Non-Disruptive as Possible:
Our concern is will a blade function normally running 1.1 code when all the other components are still on 1.0 code? I suspect it will. If it does work, the plan is to take our time to update all of the blades over a week or so following the below steps.
This week we updated and activated the firmware on the Interface Cards for all 7 spare blades. Next we did the same thing for the BMC firmware for the 7 spare blades.
Next step is to put an ESX host into maintenance mode and move the Service Profile to the blade running 1.1 code with an Emulex mezz. card. We can confirm the ESX host functions and then move some test, dev, and then low end prod servers to this ESX host. This will allow us to develop a comfort level with the new code. This step should fit into our schedule early next week. If we see no issues we will be able to proceed with the process of putting additional ESX hosts into Maintenance Mode, then in UCSM update the firmware, activate firmware and then reboot the blade. This process will allow the ESX cluster blades get fully updated with no impact on the applications.
For our W2K8 server running vCenter we will perform a backup and schedule a 15 min downtime to reboot it and move the Service Profile to our spare blade running 1.1 code with the Emulex mezz. card. We can then confirm functionality, etc. If there are issues we can just move the Service Profile back to the server running 1.0 code and we are back in service. This same process will be repeated for the 2 – W2K3 servers running on UCS blades.
By using the flexibility built into VMware that we all use and love (vMotion, DRS, etc.) and the hardware abstraction provided by the UCS Service Profiles the process of updating the blade interface and BMC firmware should be straight forward with minimal impact on the end users. I will give an update next week . . .
To continue on with the foundational concepts of the UCS Manager . . .
You find templates available under Server tab (Service Profile Templates), LAN tab (vNIC Templates) and SAN tab (vHBA templates). When creating a new template you select the type to be Initial or Updating; the difference being an updating template will “update” any changes to objects using that template. A vNIC Updating Template that was changed from native VLAN 2 to native VLAN 100 will apply that change to all objects using that template. An Initial Template will maintain the settings defined at the time of creation and not change. Which type of Template you use or combination of type will depend on your workflow and change management.
In our case we choose to use Updating vNIC and vHBA Templates and Initial Service Profile Templates. This seems to give us flexibility with possible changes in the future. For example, we created a vNIC Updating Template for ESX servers with 12 trunked VLANs and the service console VLAN set as Native. As we add additional VLANs in the future we only have to make this change to the updating vNIC Template and it will populate to all of our ESX hosts (I think this change would not require a reboot).
Service Profile Templates are one of the ways to generate a new Service Profile. A “working server” is made up of a physical blade being associated with a Service Profile. When creating a Service Profile Template you define various functions using the values created in polices, pools and templates. Meaning you select which boot policy, local storage policy, vHBA setting, vNIC settings and blade assignment you want to use for this Service Profile Template. The other ways to create a Service Profile are to clone an existing one or create it from scratch.
The Service Profile is the key to the stateless function of the UCS system. This is what abstracts the physical identifiers from the hardware and allow you to move a “server” between physical blades. We were demoing this concept/process and immediately the person said, “oh, that is like vMotion but for the physical level”. Please note, your “server” has to be powered down to perform this move to a different physical blade, however, I think it is only a matter of time before that changes.
How to Name something in UCS, get it right the first time:
You know how in most applications you are able to give friendly names to objects and items and you can change them? For example, in EMC Navisphere you can name a LUN anything you want and you can change it? Well, in UCS it was designed to use the name as the value to identify an object, meaning once you give something a name you cannot change the name. You learn this pretty early on in your configuration process, there were many times when we had to delete a pool, template, etc. because we did not follow the correct naming convention or did not have something named correct. Because of this we had a defined step in our process to go through and clean up all the “junk” from the first few days of building and testing (said goodbye to “foo” service profile ).
LAN tab Concepts:
The LAN Cloud refers to the northbound LAN, the connection of the 6120 to the rest of the LAN. The Internal LAN refers to the southbound LAN connections to the chassis and blades.
With a vNIC and vHBA, when the blade comes up it will be assigned to a 6120 northbound I/O port by some process within the system. UCS gives you the ability to “pin” a MAC or WWPN to a specific I/O port (Ethernet or Fiber Channel). Lets say you have a blade running Microsoft SQL and you wanted to make sure that blade always had a dedicated 4 GB fiber channel port to the SAN fabric. You can define a SAN Pin Group to alway use FC port 2/4 on Fabric A and FC port 2/3 on Fabric B. I think the power of Pin Groups will come more into play once you can use the Palo CNA adaptor and you can Pin a VM guest to specific ports. We have not used pin groups in our configuration, not sure if we will.
The color Cisco picked to represent the fiber channel ports in the GUI is interesting, red. It took a few days to get use to seeing red ports for the FC and not wanting to figure out what was wrong with them. I do not know if that was the best color choice. The LAN ports are done in a Carolina blue.
There are several foundational concepts of the UCS Manager that need to be understood before being able to grasp the new way of deploying and managing blades. Here is part 1 of an explanation of the concepts as I currently understand them.
By default there is the “root” organization. Organizations are global to UCS, meaning if you create an organization for example called “finance” it will exist in all areas of the system. You may want to use sub-organizations as a means to organize objects, control access or manage access within the system. For example, your finance group may have 6 blades they manage and have control over. You can allow the finance users to see and manage only items in the finance organization.
You can think of Organizations as being similar to Organizational Units (OU) in active directory.
Policies exist under the 3 areas of Server, LAN and SAN. A policy is a defined parameter that is assigned to service profiles. For example, the Server policy called Boot Policy is where you define various means to boot a blade; via LAN, boot from SAN, local disk, etc.
Cisco defines the policies as well as the options within each policy. Policies are dynamic, if you change a setting within a policy it will take effect wherever that policy is defined (it may require a reboot of the blade depending on the policy). Over time I can see Cisco adding additional functionality by adding new policies. Things like Bios Boot policy to allow you to have groups of blades boot to different bios levels, etc.
A pool is a defined set of values to be used by service profiles. Currently you can define pools for the following items:
Server Tab: Server pool (physical blades) and UUID suffix
LAN Tab: MAC address pool
SAN Tab: WWNN and WWPN pools
This is where you get to define what MAC address, WWNN and WWPN values you want to use, which can be powerful. This is because you can pre-define your WWPN and zone your fabric ahead of time, so your server people are not waiting on your storage people to zone new servers.
For example, we created 2 WWPN pools for our ESX servers; 24 WWPNs for fabric A and 24 for fabric B. Then we exported the list of generated WWPN and put them into my spreadsheet that generates all of the CLI commands for creating zones on my Cisco MDS switches. Within 20 minutes we had pre-zoned my next 24 ESX hosts on the SAN fabric. As we add ESX hosts via UCS each will get a WWPN from the pool and already be zoned to my storage array. Note, the storage person will still get to control what storage resources the new host sees.
In addition, you can have more than 1 range defined for any of the pools. Meaning you can define a range of MAC addresses that you only want to use for ESX hosts and another range that will always be for Windows 2008 servers. It really comes down to how you want to manage your environment. Does it add value to be able to identify that a specific MAC address is a UCS blade running W2K8? We decided it did not, but it could be useful for others. This same thought process was used when defining each of the pools. Obviously we felt the WWPN pools had value by creating ranges for specific purposes (i.e., A and B fabric).
That is a start, more to come later.