Elephant in the DRS room
Posted Friday, July 10, 2009 in Technology 6 comments
This post was triggered by the latest VMWare article on DRS performance. I have been thinking and talking about this subject ever since I first saw a DRS demo at VMworld in September 2008 in Las Vegas.
DRS (Dynamic Resource Scheduler) is an awesome piece of technology. Dynamic load balancing in distributed systems is a non-trivial task and I applaud the DRS team. A lot of the discussion about DRS is centered on the scalability issues: things like how many physical servers and how many guest VMs (virtual machines) can be managed under DRS. However I would like to draw the attention to the elephant in the room nobody seems to be noticing. Elephant, thy name is storage.
Let's take a closer look at the configuration described in the article. (I am taking only of the vSphere cluster with the servers and storage, not the load generators and other supporting elements):
ESX Hosts (4) : HP DL380 4 Dual socket, quad core Intel Xeon 5450 3.0GHz; 32GB memory; dual port QLogic QLE2462 HBA.
After 15 minutes on the HP website I could not get at the exact configuration described here so I had to spend about 45 minutes in instant chat with HP sales and they came up with a quote of $6,691 discounted to $5,600 if I buy four of them on my credit card immediately. For the sake of simplicity let's assume $6,000 per server including the Fibre Channel (FC) controllers.
Think about it: $6,000 buys vast amounts of computing power today. What's even more important is that I can get exactly the same power from more than one source. I am sure that my Dell sales rep would be happy to beat the HP quote by offering an equivalently configured Dell server for 10-15% cheaper. And I know that the Dell servers will run exactly the same system and application software, I can in fact mix and match Dell, HP and my favorite whitebox servers in the same configuration and DRS will not (and should not!) know the difference.
This is a dramatic achievement of the last 10-15 years - data center servers are true commodities and the prices reflect that.
Now let's look at the storage used in this benchmark:
Storage (1) : CX 4-960 with 188 15K rpm FC disks.
First, notice that the disk size is not specified, so let's make some educated guesses. Getting a quick quote for a high-end storage system is normally a much more involved exercise, so I am still waiting for few sales reps to get back to me. Almighty Google found a reference point at the Reliant Technology website: a Clariion system with 48 15K 300GB FC drives for the list price of $157,500. So I figure to get 188 specified in the DRS article would take 4 of these, yielding a total price tag of more than $600,000! Now, I understand that hardly anybody pays list prices for such systems - but more than half a million dollars!? No wonder the DRS benchmark is listed as a joint EMC/VMware project. It would be very hard – even for VMware – to implement this project as a pure research exercise.
Something is seriously wrong with this picture! Here we have four enormously powerful servers at $6,000 each ($24,000 total), connected to more than half a million dollars worth of disks.
Does anybody building data center infrastructure think it makes any sense to spend 25 times as much on storage as on servers?
The test cases are configured to reach specific levels of CPUload and as far as I know DRS uses CPU and mеmory consumption as inputs to its policy. We are obviously not dealing with a a disk bound configuration. We have a few inexpensive commodity servers connected to a MAINFRAME style storage system.
So DRS has demonstrated that it can achieve "15 - 47% gains in aggregate performance" in a CPU bound configuration when $24K worth of servers are driving $600K worth of disks. Is this interesting technology? Yes. Does it make economic sense? I'm not so sure.
What if you could reduce storage costs by a mere 10%, which is trivially easy to do? With the money saved, you could buy 10 additional servers for an additional 250% compute performance. In that config, you don't need DRS because you're so over-provisioned on server power, and I'd bet your apps will run even faster than the benchmark.
Now what if you could cut the storage costs by 50%, which is also trivial to accomplish? With the money saved, you could add a headcount to IT (or save your own job!) with lots of money left over for a spectacular offsite party for the whole team.
Before anyone complains that I'm being unfair here, let me acknowledge that my numbers are rough. But I could be off by a lot and the point is still valid. It might also be argued that the storage configuration here was a "benchmark special" and would never be used in a real environment. To which I would respond, uh yeah, that's correct. But then we're begging the question about what the whole point of the benchmark was. It's safe to assume that there was a good reason why VMware/EMC configured that many expensive spindles for the benchmark.
Here lies the problem. DRS - or manual load balancing for that matter - can be effective only if you can reliably predict the performance impact of migrating virtual machines from one physical server to another. VMware has done excellent job in providing the tools for this task for cases where the performance is determined ONLY by CPU and memory resources. All the load balancing examples you read about are valid only when the storage I/O channel is grossly over-provisioned.
Note, it is not about the storage capacity, it is about the channel capacity - both in terms of bandwidth and latency. I suspect that in order to achieve the same level of performance described in the article in a carefully managed static configuration, one would need less than half of the total I/O channel capacity (and less than half of the total storage cost). But there is no way to do such reduction in a dynamic configuration, so we have to over-provision.
To be fair, VMware is the absolute leader among all the the virtualization vendors. DRS is the state of the art today, and no other vendor can come close. DRS limitations come from the limitations of the existing storage stack in the virtual environment.
The only way to provide predictable performance and quality of service in virtual servers today is to throw too much money into storage hardware - either in terms of large number of expensive high performance spindles, or a complex (i.e. expensive) array with lots of internal processing power such as a V-MAX class device. In terms of total cost the results would be very similar – your storage is about an order of magnitude more expensive than your servers.
It shouldn't have to be that way. It is time for storage for virtual servers to go the way of the servers themselves – from proprietary mainframes to commodity. The key to that transition is the storage software specifically designed for virtual servers. Then VMWare will be able to repeat the DRS benchmark on a system where costs of storage and servers are much more balanced.





Comments
gopal 2:28pm PST on July 28th, 2009
for pricing try
http://storagemojo.com/storagemojos-pricing-guide/
VirtVeteran 11:26pm PST on July 31st, 2009
wait for the IO DRS in the coming years, which will solve the exact problem you have mentioned..
TF 4:11pm PST on August 5th, 2009
So I see the problem statement—somewhat lengthy here, but boiled down to: virtualization reduced server costs, why not storage—but what are the problems with the alternatives that VMWare partners with? Whether iSCSI via Left Hand or maybe those who do thin-provisioning like Compellent?
Alex Miroshnichenko 4:22pm PST on August 5th, 2009
“IO DRS” as you call it won’t happen without radical changes to the storage I/O software stack. I think that the keywords in you comment are “coming years”.
I do hope that we at Virsto will shorten the time horizon..
Alex Miroshnichenko 4:26pm PST on August 5th, 2009
Well, there are obviously alternatives to EMC among the storage hardware vendors, but I think that the right way to solve the problem is to do it in a vendor independent fashion, i.e. in the VMFS (or architecturally equivalent) software layer.
Scott Wilson 11:45am PST on January 30th, 2010
You don’t seem to mention other savings….. such as space saving costs. With blades you can have a ton of space and have all data on external storage. Power costs….. less downtime, ease of management.
It all adds up!
Leave a Comment
Commenting is not available in this channel entry.