HealthITGuy's Blog

The evolving world of healthcare IT!

What is Cisco Support Like for UCS?

A common question I get when talking with others about our Cisco UCS production environment is if we have had any issues that required us to deal with Cisco TAC. Like with anything we have had a few things that required a call. By the way, the phone number is the same for any Cisco product. Here are a few examples for you.

One of our first calls had to do with a failed 8 GB DIMM in one B200 M1 server blade. We noticed a warning light on the blade and went to the UCS Manager to investigate. We were able to quickly drill down to the effected blade’s inventory and go to the memory tab. This screen provided the details of the failed DIMM’s slot location and confirmed it’s failed status. Since the workload running on this blade was VMware ESX we put it into maintenance mode, powered down the blade and replaced the DIMM with a spare. It was time to open a ticket with TAC.

The TAC engineer took down our information and sent out a replacement DIMM within 4 hours and we were done with the ticket. I asked our server person what he thought of dealing with TAC and he did not expect it to be that easy. Typically in the past with other server vendors we would have had to run a diagnostic tool to determine which DIMM and then open a trouble ticket. We would have to down the server, re-seat the DIMM, and wait for it to fail again. Once it failed again then we would get a replacement. So this call process with Cisco seemed to be smoother.

Another trouble ticket was related to a VMware ESX host that, post a reboot, would not see the boot partition. After some troubleshooting, it clearly was a ESX OS issue and our VMware admin was ready to re-image the server. However, we thought this would be a good test for Cisco TAC so we opened a ticket. We were surprised when TAC gave the case to an ESX server person at Cisco who within 20 minutes had resolved the issue and the server was back in production. So our expectations were exceeded again.

The one trouble ticket that took sometime was when we wanted to install Windows 2003 standard 64 bit bare metal on a blade with the Emulex interface card. This is easy to do with Windows 2008, however the challenge was getting the right drivers on a media type that the Windows 2003 installation process could recognize. It wanted to see the drivers on a either a CD or floppy disk which you provided by emulating the media. I personally did not work this ticket but it took time over 3 days to get everything completed. In the end we now have a process down and 2 servers in production.

Overall, Cisco has exceeded our expectations when it comes to dealing with trouble tickets around the UCS products successfully. It has been clear to us that Cisco has put the resources into support and have the right folks in place to deal with a variety of potential issues customers may run into.

Advertisements

June 25, 2010 Posted by | Cisco UCS, General | , | Leave a comment

Victims of Consolidation

We have been spending some time cleaning up in the datacenter pulling out all of the old server hardware that is left over from migrations to a virtual environment.  In this most recent round of cleanup, there are over 60 old physical servers in these stacks which provided a lot of compute cycles for us in the past.  Their time has come to an end.  And to think those 60 workloads can easily run on 2 Cisco UCS B200-M1 blade now with VMware ESX 4.x and EMC PowerPath/VE! 

June 10, 2010 Posted by | General, VMWare | Leave a comment

Cisco Developer Network for Unified Computing

Back in April 2010 Cisco added Unified Computing to the Cisco Developer Network.  The idea of the network is to provide resources, forum and blog areas focused on development with the UCS Manager open XML API.  While I was in San Jose I spoke to a new Cisco person who is going to be focused on expanding this area and content.  So today I finally took some time to check it out . . .

You can tell it is new and the content is just starting but there is the beginning of some cool stuff. Currently it is organized under Unified Computing — UCS Manager. 

You will find sample code in the blog and resource sessions.  The meat of the resources are developer guides, some white papers and many tools and samples of PowerShell, Scripts and XML Samples.  I found the Getting Started session to be my favorite area.  This section gives you a direction to start digging into the XML API and how to approach coding in it.  Make sure to check out the video on this page that gives a good overview intro to the XML API.

You will also find information on the UCS Platform Emulator which can be used for testing and developing against.  You can request access to the emulator tool from this page.  The next item listed on the page is information about an actual UCS Sandbox that is available to developers hosted at a cloud provider.  For those folks ready to start writing some code for the API, there are the tools available to have at it.

My shop tends not to do too much in-house development but occasionally we will write some Perl or other scripts to automate tasks against AD or for ESX/VMware tasks.  I am starting to see a few areas in UCS that we might gain some benefits from a few scripts.  For example, when we perform updates to the firmware/bios there are about 3 or 4 steps you perform for each service profile.  These steps could be coded and managed to perform the tasks more timely and consistent.  As our UCS installation grows there may be more opportunities for other tasks/functions.

Now if you take a broader look at what this Developer Network will mean for the growth and management of Cisco UCS you can easily image how the Altiris, Tivoli, EMC Ionix, etc. of the world could eat this stuff up.  You can tell Cisco had vision early on to make this system open so it can fit well into most existing enterprise management frameworks either directly from the vendors or in-house development.

So check it out if you get a chance and come back occasionally to see how it evolves over time.

June 8, 2010 Posted by | Cisco UCS | Leave a comment

Cisco Unified Computing Advisory Board (UCAB)

Just finished attending the Cisco Unified Computing Advisory Board (UCAB) in San Jose this week.  Great experience to have the opportunity to interact with other production UCS customers from various lines of business, the leadership of the Server Access and Virtualization Business Unit (SAVBU) and many other key Cisco staff focused on the success of the Unified Computing platform.  I am not able to go into many details or specifics on meeting content, however I will try and give you a sense of the what and why for the advisory board.

 We spend two solid days focused on customer feedback on our experiences; successes, challenges and what can be improved as well as getting feedback on product growth and future directions.  The key take away from this focus of the event was Cisco’s strong commitment to understanding the real world implementations and the desire to continually improve the unified computing experience and product.

 As you can imagine there was also a large focus on educating us on the short-term growth and roadmaps as well as discussions on longer-term thoughts, designs, etc.  Again this was framed around taking customer feedback to help shape things moving forward.  On this front, I quickly realized Cisco is not standing still and has an amazing vision for what “unified computing” will mean in the future.  The narrow thought of “Cisco is in the server business” quickly became clear that the server is merely a component of Cisco driving the unified computing business.  I think it is clear from the reaction and responses seens so far from other server vendors that there is a realization the future of compute is not just about a server.  That is why you see others scrambling to have their own “unified compute” platform by quickly cobbling existing technologies together and branding some form of unified computing.  The overall benefit of this competition is that all compute vendors will get better and continue the push and move in this direction.

 There is no question in my mind that Cisco is in this market as a leader and will be there long term.  I think a lot of organizations are beginning to understanding this fact and “get” the benefits and cost savings that UCS brings to the table. 

 What is also interesting to me is the timing of when Cisco executed the launch and growth of UCS, during the economic turmoil of 2009.  If you think about it, you could not have picked a more challenging economic time to introduce a paradigm shift in computing.  The up side to the timing is the cost benefits of UCS stood out for us early adopters.  Cisco is continuing to expand their investments in staff, functionality and advancements in technology, which is only strengthening the product.

 Cool stuff!  Look for the next UCS code release 1.3 to happen very soon in June 2010 and an updated ESX 4.x driver for the VIC (Palo card) that works with EMC PowerPath/VE as well.

June 7, 2010 Posted by | Cisco UCS | , | Leave a comment

EMC World 2010: Boston see you there

Next week is EMC World 2010 in Boston and I am fortunate enough to be attending and presenting.  If you are there and want to check out the following presentations on Cisco UCS in a production healthcare environment:

Monday May 10, 11:10 AM to 11:25 AM:  Cisco booth Theater Presentation:    Cisco UCS Solving Business Challenges: – Moses Cone Health System 

Tuesday May 11, 12:25 PM to 12:40 PM:  Cisco booth Theater Presentation:    Cisco UCS Solving Business Challenges: – Moses Cone Health System 

Wednesday May 12, 3:30 PM to 4:30 PM:  General Session:  Implementing Cisco Data Center 3.0: Cisco IT and Moses Cone Health System 

The 2 sessions in the Cisco booth theater will be a quick overview of our UCS experience and then the full session on Wednesday will be done in conjunction with Sidney from Cisco.  That will be focused on how Cisco IT has implemented and benefited from using Cisco UCS.  Then I will speak about our experience in more details. 

Hope to see some of you there.

May 7, 2010 Posted by | Cisco UCS, General | , | Leave a comment

UCSM 1.2 Feature: KVM Launch Manager

Here is a cool new feature in the UCSM 1.2.1b code that is going to come in handy .  You get to the Cisco UCS Manager (UCSM) via a web browser.  Now when you hit the main web page you have the option to run the UCSM or something new called the UCS – KVM Launch Manager.

Do not need to give a system admin login to UCSM, they can get to the KVM securely from this web page now.

So why is UCS – KVM Launch Manager a cool thing? 

We run a few Windows servers directly on UCS b M1 blades.  The system administrators of those boxes have to connect to them from time to time, this is typically done with a RDC connection with the -console switch.  If there is a problem with that approach they would need to log into UCSM and connect to the KVM.  Now, if the KVM access is all that system admin needs for that server I can have them use the UCS KVM Launch Manager and they can launch a KVM session from a secure web page using their AD login.  This is a nice new feature.

May 1, 2010 Posted by | UCS Manager | , | Leave a comment

UCS Code Upgrade: 24 Seconds Explained

There was a lot of interest in my last post about our code upgrade.  What stood out to many (as well as us) was the 24 second disruption in network traffic on each fabric interconnect side when it was rebooted.  Meaning I upgrade the B side and when that fabric interconnect goes into its reboot all of the Ethernet traffic (remember the fiber channel traffic had NO issues) that was talking on the B side is disrupted for 24 seconds.

The answer was in my core 6513 configuration regarding spanning tree.  I would like to thank Jeremiah at Varrow and the guys at Cisco who helped us figure this out. 

Turns out that one of the first configuration confirmation items in the code upgrade process (really it should have been setup all along . . .) was making sure the port channels that the fabric interconnects are connected to are set with spanning-tree portfast trunk.  An email was sent to get this confirmed and configured but it got missed, to bad it was not in the Cisco Pre-Requisite document as a reminder.  What this command gives you is if and when the trunk port link to the fabric interconnect goes away for any reason the 6513 will not go through the normal spanning tree timers and quickly allow the traffic to flow on the remaining path (in our case the remaining connection to the fabric interconnects).

We have now enabled spanning-tree portfast trunk on our port channels and should be positioned now to eliminate that pesky 24 second Ethernet disruption that impacted some of the traffic.  Details, details!

May 1, 2010 Posted by | Cisco UCS, UCS Manager | , , | Leave a comment

UCS Code Upgrade Success: Running 1.2(1b) Now!!

I have been blogging for a while about the planned code upgrade to our production UCS environment for a while now and we finally cut over to 1.2(1b) on all the system components!  Success.  Here is a quick run down. 

We decided to go with the 1.2(1b) code because the main difference between it and 1.1(1l) was the support for the new Nehalem CPU that will be available in the B200 and B250 M2 blades in about a month.  We want to start running with these new CPUs this summer; more cores means more guests and lower cost. 

The documentation from Cisco on the process was pretty good and provided great step by step instructions and what to expect.  We followed it closely and did not have any issues, all worked as expected. 

We have upgraded the production UCS system!

Here is how we did it:

First step was to perform a backup of the UCS configuration (you always want a fall back, but we did not need it). 

We started with upgrading the BMC on each blade via the Firmware Management Tab; this does not disrupt the servers and was done during the day in about 30 minutes.  We took it slow on the first 8 BMC and then did a batch job for the last 8. 

At 6 PM we ran through the “Prerequisite to Upgrade . . .” document a second time to confirm all components were healthy and ready for an upgrade; no issues.  Next we confirmed that all HBA multipath software was healthy seeing all 4 paths as well as confirmed NIC teaming was healthy; no issues. 

At 6:30 PM we pre-staged the new code on the FEX (IO modules) in each chassis.  This meant we clicked “Set Startup Version Only” for all 6 modules (2 per chassis times 3).  Because we checked the box for “Set Startup Version Only” there was NO disruption of any servers, nothing is rebooted at this time. 

At 6:50 PM we performed the upgrade to the UCS Manager software which is a matter of activating it via the Firmware Management tab.  No issues and it took less than 5 minutes.  We were able to login and perform the remaining tasks listed below when it was complete.  Note, this step does NOT disrupt any server functions, everything continues to work normally. 

At 7:00 PM, it was time for the stressful part, the activation of the new code on the fabric interconnects which results in a reboot of the subordinate side of the UCS system (or the B side in my case).  To prepare for this step we did a few things because all the documentation indicated there can be “up to a minute disruption” of network connectivity (it does NOT impact the storage I/O; fiber channel protocol and multipath takes care of it) during the reboot.  This disruption is related to the arp-cache on the fabric interconnects I believe, here is what we experienced. 

UCS fabric interconnect A is connected to the 6513 core Ethernet switch port group 29.  UCS fabric interconnect B is connected to the 6513 core Ethernet switch port group 30.  During normal functioning the traffic is pretty balanced between the two port groups about 60/40. 

My assumption was that when the B side goes down for the reboot, we would flush the arp-cache for port group 30 and then the 6513 will quickly re-learn all the MAC addresses now reside on port group 29.  Well, it did not actually work like that . . . when the B side rebooted the 6513 cleared the arp-cache on port group 30 right away on its own and it took about 24 seconds (yes I was timing it) for the disrupted traffic to start flowing via port group 29 (the A side).  Once the B side finished its reboot in 11 minutes (the documentation indicated 10 mins.) traffic automatically began flowing through both the A and B sides again as normal. 

So what was happening for the 24 seconds?  I suspect it was the arp-cache on the A side fabric interconnect knew all the MACs that were talking on the B side so it would not pass that traffic until it timed out and relearned. 

As I have posted previously we run our vCenter on a UCS blade using NIC teaming.  I had confirmed vCenter was talking to the network on the A side, so after we experienced the 24 second disruption on the B side I forced my vCenter traffic to the B side before rebooting the A side.  This way we did not drop any packets to vCenter (did this by disabling the NIC in the OS that was connected to the A side and let NIC teaming use only the B side). 

This approach worked great for vCenter, we did not lose connectivity when the A side was rebooted.  However, I should have followed this same approach with all of my ESX hosts because most of them were talking on the A side.  The VMware HA did not like having the 27 second disruption and was confused afterwards for a while (however, full HA did NOT kick in).  All of the hosts came back, as well as all of the guests except for 3.  1 test server, 1 Citrix Provisioning server and 1 database server had to be restarted due to the disruption in network traffic (again the storage I/O was NOT disrupted; mutlipath worked great). 

Summary:

Overall it went very well and we are pleased with the results.  Our remaining tasks are to apply the blade bios updates to the rest of the blades (we did 5 of them tonight) using the Service Profile policy — Host Firmware Packages.  These will be done by putting each ESX into Maintenance Mode and rebooting the blade.  It takes about 2 utility boots for it to take effect or about 10 minutes each server.  Should have this done by Wednesday. 

What I liked:

—  You have control of each step, as the admin you get to decide when to reboot components.
—  You can update each item one at a time or in batches as your comfort level allows.
—  The documentation was correct and accurate. 

What can be improved:

—  Need to eliminate the 24 to 27 second Ethernet disruption which is probably due to the arp-cache.  Cisco has added a “MAC address Table Aging” setting in the Equipment Global Policies area, maybe this already addresses it.

April 27, 2010 Posted by | Cisco UCS, UCS Manager | | 4 Comments

Evolving Healthcare IT and how do you adapt?

To prep for the Cisco panel discussion, Ben Gibson asked what my role was at my organization.  The answer was something like “it started out as managing the network infrastructure: the cabling, LAN/WAN, servers, etc.  Then it continued to evolve into virtualization and storage to where I am now.”  So I now refer to my role as the manager of technology infrastructure.  What Ben said after my explanation has really got me thinking . . . he simple said something like “oh, just like the way the industry has evolved”.

Ben is right.  In my healthcare organization my role and responsibilities evolved overtime as the technologies changed.  Based on our IT department size this was partly due to having a small head count and needing to do more with the staffing you already had in place.  This approach can be challenging at times but also lends itself to being creative, expanding your knowledge by trying and doing new things which in the end can be down right fun at times (the opposite can also be true :)). 

The benefits our organization has gain from the way we have been able to consolidate and manage the different staffing roles as we have changed technologies has made us more flexible and able to move forward.  I am not saying this is easy and does not take work to make it happen but looking back at how we got to where we are has made me think about it.

As I have been meeting and talking with people from other organizations there is a wide range of comfort, adoption and acceptance with industry change.  For example, we gained the comfort level with VMware DRS and vMotion very early on and have been allowing VMware to decide where a server workload should reside by allowing it to automatically moving the servers between hosts.  I am surprised when I hear of others who still “balance” the VM workloads by checking DRS recommendations manually and then vMotioning the workload.  Or worse, talking to a network switch vendor’s “virtualization expert” and hearing him say virtualization is still in its infancy; that comment/belief saved me time in the end and made my investigation with that vendor shorter. 

To take the thoughts further, one of the keys to reducing costs, complexity and increasing your flexibility is when you really start to converge and maximize your datacenter resources.  It becomes very difficult to realize these savings and efficiencies if your staff/groups/teams do not work together.  You need to ask yourself, does my storage team talk with my server team?  Does the network admin know what our virtualization guy is doing?  In the organizations where I have seen a storage group that works in their own bubble, that organization begins to struggle by spinning their wheels and wastes resources and dollars.

So why this topic for a blog post?  Yes, my team and organization get to use a lot of cool and exciting technology that makes our jobs fun at times, saves the organization time and money and has made us flexible and agile, but it did not just happen.  You have to change as the technology changes, if you can do that you will be on the right track.

Disclaimer:  No workplace is perfect and my organization is far from perfect, but it is pretty damn good to be here.

Addendum:  My director read this post and reminded me “It’s certainly worthy of bringing that weakness (poor communication) to light.  If you think about it we unfortunately have some of the same symptoms.” 

That is true, communication is always a work in progress whether it is in professional or personal life, there is always room for improvement!

April 14, 2010 Posted by | Healthcare IT General | | Leave a comment

San Jose: That’s a Wrap

Well that was a great day!  When I was invited to represent my organization at the Cisco Datacenter Launch to talk about our UCS experience I was humbled, excited and nervous.  However, it is not often in someone’s career to have the opportunity to be included on a panel with such innovative leaders in the technology industry as David Lawler, Soni Jiandani, Boyd Davis and Ben Gibson.  Everyone was down to earth, personable and very comfortable to work with on the panel.  The goal was to make the event a relaxed discussion and the point of view of the customer was truly important to the panel and Cisco.  I was also amazed at how many people pull the details together for an event of this nature.  Lynn, Janne and Marsha were great making sure I was prepared and helped make everything go off smoothly. 

This customer focus continued to be evident after the video was completed.  I was able to spend the rest of the day with many key individuals, who made time for me, from the UCS business unit.  We had some deep technical discussions on various topics like firmware upgrades, wish lists, directions, ease of use vs. levels of control, etc.  I was asked by everyone for input regarding ways to improve as well as talking about how we are using the system.

To end my day on the Cisco campus, David Lawyer invited me to his office to meet with him and Mario Mazzola, Senior Vice President of Server and Virtualization Business Unit (SAVBU).  Mario has been a key technology person in Silicon Valley, leading the creation of the 6500 switch product, the Cisco MDS fiber channel product and now the Cisco UCS platform (along with many other accomplishments).  I think it is fair to say he is a legend in the industry (however, my impression is he is very humble and quickly acknowledge’s others for their contributions to the projects).  We had a conversation focused on the customer views related to the product and how it is Cisco’s goal to continual improve the system.  Mario and David are very down to earth people and it was clear to me Cisco is very customer focus from the top of the organization down.

Mario Mazzola, myself and David Lawler

So that’s a wrap for this trip to Cisco in San Jose for now . . .

April 6, 2010 Posted by | Cisco UCS, General | | 1 Comment