Today was a cool day. We setup a fiber channel port-channel for the connection between our 6120 and the MDS 9148 switches. Then I recieved my first B230 blade all tricked out!
It took me all but a minute to rip over the box and pull the cover off. Man, 32 DIMM slots on a half blade, what a pretty sight.
Yes, those are 32 – 8 GB memory DIMMS for a total of 256 GB of memory. Also, we have a pair of Intel Xeon L7555 CPUs each with 8 cores!
So our plan next week is to zone the blade to the new VMAX, load VMware ESXi 4.1 and start messing around with it.
This stuff is exciting.
Here are a few things that have jumped out so far that I like with the 1.4 code:
– Support for the new B230 blades; half size blade with 2 sockets and 32 DIMM slots (16 cores and 256 GB memory!!)
– Able to manage the C-Series servers via UCS Manager (the rack servers)
– Power capping to cover groups of chassis; this is very powerful now. Think about it, you can have 4 chassis in 1 rack all sharing 2 or 4 – 220 circuits. Now you can cap, monitor and manage the amount of power by groups of chassis not just per blade or chassis.
– Software packaging for new server hardware that does NOT require the IO fabric upgrade to use the new servers. Nice!
– Bigger and Badder! Support for up to 1024 VLANS and up to 20 UCS chassis (160 servers from 1 management point!!).
– Fiber Channel connectivity options, now can do port channeling and FC trucking as well as some limited direct connection of FC based storage (no zoning ability . . yet).
OK the list goes on and on, they have packed a lot into this release.
Checking out the new items in UCSM, I had to grab a few screen shots of the following:
Power! You ever wonder how much power a server or chassis is using? Now you know, check this out! I am loving this.
For those UCS users out there, it has not always been very clear what the impact of making various changes to a Service Profile might do to the workload. They have improved with each release, but this is some great detail now:
Cool stuff in Cisco UCS 1.4 code, I hope to have more time to share with everyone as we continue to maximize our investment. Time to go home . . .
I am geeked! We just completed the code
upgrade on our production Cisco UCS environment and it was
We have been in production on Cisco UCS for 1
year and 22 days now and have ran on 1.1, 1.2, 1.3 and now 1.4
code. So today was our 3rd code upgrade process on a
production environment and each time things have gotten better and
cleaner. Why am I so excited? Think about it . . . with UCS
there are 2 Fabric Interconnects with some chassis hanging off of
them with a bunch of servers all using a single point of
management. Everything is connected with redundancy and if
all of that redundancy is operational and live you truly can reboot
half your IO fabric and not drop a ping or storage connection.
In a storage world this is standard and expected but in a
server blade world you would think to accomplish the same level of
high availability and uptime provided by a SAN, there
would have to be a lot of complexity. Enter Cisco UCS! An
hour ago we upgraded and rebooted half our IO infrastructure that
serves over 208 production VM Guest Servers running on 21 VMware
ESX hosts and another 8 Windows server (all running active SQL or
Oracle databases) blades without dropping a packet. Then I
did the same thing to the other IO infrastructure path with NO
ISSUES. This is just badass. I suspect in a year this
type of redundancy and HA level in a x86 server
environment will be an expectation and not an exception.
UCS Code Upgrade Experiences:
In March 2010 we
performed the first upgrade while in production to 1.2 code (you
can check out my blog post for all the details). The major
impact we experienced with this one was due to a human issue; we
forgot to enable spanning tree port fast for the EtherChannels
connecting our Fabric Interconnects. Our fault, issue fixed,
move on. In December 2010 we implemented 1.3 code for a few reasons
mainly related to ESX 4.1 and Nexus 1K. Our only issue here
was with 1 Windows 2003 64-bit server running on a B200 blade with
OS NIC Teaming which failed to work correctly. Again,
not a UCS code issue but a server OS teaming issue. We had 3
servers using NIC Teaming in the OS, so we decided to change these
servers to hardware failover mode provided in UCS instead of in the
OS. Changes made, ready to move on. It just so happened on
the same day we did the 1.3 upgrade Cisco released 1.4 code just in
time for Christmas (thanks SAVBU). This time we had all our
bases covered and each step worked as expected; no spanning tree
issues, no OS NIC Teaming problems, it was smooth! There was
some risk with moving to the new code so fast, but we have several
projects that are needing the new B230 blades ASAP. There are
several UCS users and partners that have already been going through
1.4 testing and things have been looking very good. Thanks to
all who provided me with feedback over the last week.
Features and Functions:
Now we get to dig into all the
new cool and functional features in the new code. I am
impressed already. I will put together a separate posts with
my first impressions. I do want to point out one key thing that I
referenced above; the need to upgrade the infrastructure to use new
hardware (B230 blades). Now that I am on 1.4 code this
requirement is gone. Yep, with 1.4 code, they have made
changes that will NOT require a upgrade of the IO infrastructure
(Fabric Interconnects and UCS Manager) to use new hardware like a
B230. So yes, things are sweet with Cisco UCS and it just got
Wow, we are at the 1 year mark of having Cisco UCS in our environment which is now our standard, what a year it has been. I was fortunate enough to present to a lot of people at a few conferences, sit on some expert panel discussions as well as the Unified Compute Advisory Board and talk to a lot of new customers during their investigation periods into UCS. It has been satisfying to hear the majority of reference customers I have talked with decided to go with Cisco UCS. There are even a few that are blogging about it!
I figured it is time to update everyone on the current configuration build we are using. I think back to when we started with VMware on 3.5 and how much more complex it all seems now but with that complexity we all have gained greater control, cost savings and agility.
Cisco UCS B200 M1 or M2 blade with 2 CPU, 96 GB memory, Cisco VIC (Palo) interface card.
Cisco Nexus 1000v distributed virtual switch.
EMC PowerPath/VE for fiber channel multi-pathing.
VMware 4.1 Enterprise Plus (not yet using 4.1i, but soon) with Enhanced vMotion enabled.
So what has changed for us since last December when we went into production on UCS? Well the new technology that Cisco, VMware and EMC keep creating and it all still fits and works together.
The release of the M2 blade brought with it the Intel Westmere CPU and 6 cores. For about the same price point we were able to add 50% more processing power in a half size blade. Then by enabling Enhanced vMotion in our VMware cluster these new M2 blades work seamlessly with the M1 blades.
Cisco VIC (Palo) mezzanine interface card was released and has provided us with a more flexible means to implement converged networking at the host level. There are a few things you can do with the Virtual Interface Card but one of the main advantages we have incorporated is carving out additional Ethernet interfaces on our ESX hosts. So what, you might say? For example, we created 2 NICs for ESX management which reside outside of the virtual distributed switch that services our guest traffic. This simplifies our build process and allows us to control and manage the host if there is an issue with the vDistributed Switch.
Cisco Nexus 1000v has been around for a little while but we have now implemented it to really bring the networking side of the virtual environment out to the control of our network engineers. As our environment has grown the desire and need to have visibility into our guest server traffic has increased. The N1KV has already been helpful in troubleshooting a few issues and we are likely to be implementing some QoS in the near future. Note, when you pair the QoS functions within UCS, the VIC and Nexus 1000v you have a very powerful tool for implementing your key compute functions in a virtual environment. You can garentee your service levels and have confidence in the implementation.
EMC PowerPath/VE development has continued and our driver issue with the VIC has been resolved for a while. The coolest new thing here is on the ESXi front, PP/VE is now supported with boot from SAN (that will be are next step moving forward).
VMware ESX 4.1 & 4.1i keeps us current on the newest tools and optimizations.
As you know IT and technology is very dynamic and we are already planning on changing things within 60 days by going to an ESXi build with boot from SAN, implementing new UCS 1.4 code so we can implement may of the new UCS features which include the new B230 blades with 16 cores and 256 GB memory all in a half size blade. Yes all this within 60 days. I can’t wait to see the workloads the B230 will handle. Oh, and we will also throw in a new EMC VMAX to push the performance level even higher.
IT in healthcare has a growing demand for performance, agility and uptime and the above technologies are what will allow organizations to handle the changes. Hang on tight it is going to be a fun ride.
Here are a few pictures for you to see at the below link of our new Cisco rack mount server, a 2U C210 M2 Server. It has 2 Westmere 6 core CPUs, 48 GB memory, ISL MegaRaid 9261 controller, 2 – 146 GB 10K Raid 1, 4 – 146 GB 10K Raid 5 and 1 hot spare disk.
CIMC: Cisco Integrated Management Controller
We are in the process of installing Windows 2008 64 bit R2 for the OS. I personally have not be doing the install but I have seen the CIMC: Cisco Integrated Management Controller (version 1.1.1). You can think of this like the iLO from HP or DRAC from Dell.
I wish there was more information in this tool, more like what I am use to in the full UCS Manager. For example, we needed to load the W2K8 drivers for the Raid controller and could not remember the model number. We looked in the inventory of the CIMC and it only listed the disks in the storage section. I had to go back to my order information to determine we purchased it with the ISL MegaRaid 9261.
Checking out more of the CIMC, there is a good amount of information in and function in here. You can mount virtual media, remote to the console, configure LDAP for authentication, define alerts, update firmware, etc. So there is what you would expect to find in a component like this for a rack server.
Thinking down the road, I can see value in having the ability to manage my C-Series servers via the UCS Manager tool. I would hope that will bring with it greater info on hardware details, service profiles and hardware abstraction as well?
I have been waiting for this product announcement and I am very excited to see it is now on the Cisco web site. Once you start to check out the spec on this new server I think you too will think that it is amazingly cool and a sweet addition to the Cisco unified computing product line.
Our current production “state of the art” blade configuration has been a B200 M2 blade with 96 GB memory with 12 cores or in 1 -6U chassis with 768 GB of memory and 96 cores.
The new B230 blows the current capacity of a 6U chassis out of the water. Check this out, with the initial release (they will add support for the 16 GB DIMM later) you can get 2048 GB of memory and 128 cores in one chassis! That is badass in my book.
Sean McGee, Cisco UCS engineer, has posted a blog with a great table breakdown comparison with 3 competitor blade products, you should check it out.
The link to the official Cisco page on the product:
A few things that standout to me:
– 32 DIMM slots in a half size blade (more memory footprint)
– Intel Xeon 6500 or 7500 series processors (more cores available in a 2 socket box)
– Optional 2 SSD disk drives (the size of the SSD allows for more room for DIMM slots, nice)
– Many options for Interface cards (including the Cisco VIC for virtual interfaces)
This thing was built for virtualization and large workloads in mind. I will definitely use the B230 M1 for new ESX hosts and I can see purchasing a few for large SQL workloads that require a physical server (of course I will use the hardware abstraction functionality in the Service Profile to maximize the flexibility).
The physical design of this blade is impressive as well. They have used SSD drives to reduce the physical space needed for local disk which gives you more room for DIMM slots. Personally, I like this move mainly because of more memory but also because I think it will push more people away from local disk and to boot from SAN. It is when you are able to utilize boot from SAN the power of Cisco UCS can shine, using the hardware abstraction in service profiles.
I have seen a few twitter questions about what code version will be required for this new blade. Based on a few things listed in the Cisco datasheet, I believe it will be code version 1.4 which should be available next quarter.
Another question I have is when can it be ordered and shipped. Oh yeah, pricing will be important too.
I have been busy lately, but this weekend I was logged into YouTube and was suggested some new videos that had been posted called Cisco UCS Whiteboard Series. I looked closer and realized Jeremiah from Varrow has put together, what turns out to be a great series, five videos describing the pros and cons of 4 common approaches to implementing compute in a data center.
What I like about the videos is Jeremiah’s common sense approach to breaking down the evolution that has occurred moving from the stand alone rack server, to traditional blade systems, unified fabric/converged network to Cisco UCS. I recommend this series to anyone looking for more information about this topic.
Part 1: Intro: http://youtu.be/KDpwnojbBwY
Part 2: Traditional Srvs: http://youtu.be/3i8rcjBFRgE
Part 3: Traditional Blades: http://youtu.be/5yWKMQwr99s
Part 4: Traditional Srv w/ Unified Fab: http://youtu.be/LhEZl_PwDFM
Part 5: Cisco UCS: http://youtu.be/amLXLWn2qOQ
A common question I get when talking with others about our Cisco UCS production environment is if we have had any issues that required us to deal with Cisco TAC. Like with anything we have had a few things that required a call. By the way, the phone number is the same for any Cisco product. Here are a few examples for you.
One of our first calls had to do with a failed 8 GB DIMM in one B200 M1 server blade. We noticed a warning light on the blade and went to the UCS Manager to investigate. We were able to quickly drill down to the effected blade’s inventory and go to the memory tab. This screen provided the details of the failed DIMM’s slot location and confirmed it’s failed status. Since the workload running on this blade was VMware ESX we put it into maintenance mode, powered down the blade and replaced the DIMM with a spare. It was time to open a ticket with TAC.
The TAC engineer took down our information and sent out a replacement DIMM within 4 hours and we were done with the ticket. I asked our server person what he thought of dealing with TAC and he did not expect it to be that easy. Typically in the past with other server vendors we would have had to run a diagnostic tool to determine which DIMM and then open a trouble ticket. We would have to down the server, re-seat the DIMM, and wait for it to fail again. Once it failed again then we would get a replacement. So this call process with Cisco seemed to be smoother.
Another trouble ticket was related to a VMware ESX host that, post a reboot, would not see the boot partition. After some troubleshooting, it clearly was a ESX OS issue and our VMware admin was ready to re-image the server. However, we thought this would be a good test for Cisco TAC so we opened a ticket. We were surprised when TAC gave the case to an ESX server person at Cisco who within 20 minutes had resolved the issue and the server was back in production. So our expectations were exceeded again.
The one trouble ticket that took sometime was when we wanted to install Windows 2003 standard 64 bit bare metal on a blade with the Emulex interface card. This is easy to do with Windows 2008, however the challenge was getting the right drivers on a media type that the Windows 2003 installation process could recognize. It wanted to see the drivers on a either a CD or floppy disk which you provided by emulating the media. I personally did not work this ticket but it took time over 3 days to get everything completed. In the end we now have a process down and 2 servers in production.
Overall, Cisco has exceeded our expectations when it comes to dealing with trouble tickets around the UCS products successfully. It has been clear to us that Cisco has put the resources into support and have the right folks in place to deal with a variety of potential issues customers may run into.
Just finished attending the Cisco Unified Computing Advisory Board (UCAB) in San Jose this week. Great experience to have the opportunity to interact with other production UCS customers from various lines of business, the leadership of the Server Access and Virtualization Business Unit (SAVBU) and many other key Cisco staff focused on the success of the Unified Computing platform. I am not able to go into many details or specifics on meeting content, however I will try and give you a sense of the what and why for the advisory board.
We spend two solid days focused on customer feedback on our experiences; successes, challenges and what can be improved as well as getting feedback on product growth and future directions. The key take away from this focus of the event was Cisco’s strong commitment to understanding the real world implementations and the desire to continually improve the unified computing experience and product.
As you can imagine there was also a large focus on educating us on the short-term growth and roadmaps as well as discussions on longer-term thoughts, designs, etc. Again this was framed around taking customer feedback to help shape things moving forward. On this front, I quickly realized Cisco is not standing still and has an amazing vision for what “unified computing” will mean in the future. The narrow thought of “Cisco is in the server business” quickly became clear that the server is merely a component of Cisco driving the unified computing business. I think it is clear from the reaction and responses seens so far from other server vendors that there is a realization the future of compute is not just about a server. That is why you see others scrambling to have their own “unified compute” platform by quickly cobbling existing technologies together and branding some form of unified computing. The overall benefit of this competition is that all compute vendors will get better and continue the push and move in this direction.
There is no question in my mind that Cisco is in this market as a leader and will be there long term. I think a lot of organizations are beginning to understanding this fact and “get” the benefits and cost savings that UCS brings to the table.
What is also interesting to me is the timing of when Cisco executed the launch and growth of UCS, during the economic turmoil of 2009. If you think about it, you could not have picked a more challenging economic time to introduce a paradigm shift in computing. The up side to the timing is the cost benefits of UCS stood out for us early adopters. Cisco is continuing to expand their investments in staff, functionality and advancements in technology, which is only strengthening the product.
Cool stuff! Look for the next UCS code release 1.3 to happen very soon in June 2010 and an updated ESX 4.x driver for the VIC (Palo card) that works with EMC PowerPath/VE as well.
There was a lot of interest in my last post about our code upgrade. What stood out to many (as well as us) was the 24 second disruption in network traffic on each fabric interconnect side when it was rebooted. Meaning I upgrade the B side and when that fabric interconnect goes into its reboot all of the Ethernet traffic (remember the fiber channel traffic had NO issues) that was talking on the B side is disrupted for 24 seconds.
The answer was in my core 6513 configuration regarding spanning tree. I would like to thank Jeremiah at Varrow and the guys at Cisco who helped us figure this out.
Turns out that one of the first configuration confirmation items in the code upgrade process (really it should have been setup all along . . .) was making sure the port channels that the fabric interconnects are connected to are set with spanning-tree portfast trunk. An email was sent to get this confirmed and configured but it got missed, to bad it was not in the Cisco Pre-Requisite document as a reminder. What this command gives you is if and when the trunk port link to the fabric interconnect goes away for any reason the 6513 will not go through the normal spanning tree timers and quickly allow the traffic to flow on the remaining path (in our case the remaining connection to the fabric interconnects).
We have now enabled spanning-tree portfast trunk on our port channels and should be positioned now to eliminate that pesky 24 second Ethernet disruption that impacted some of the traffic. Details, details!