Here is the story of a 10,600 instance (i.e. a multi-core server) HPC cluster created in 2 hours with CycleCloud on Amazon EC2 with one Chef 11 server and one purpose in mind: to accelerate life science research relating to a cancer target for a Big 10 Pharmaceutical company.
Our story begins…
First, when we got a call from our friends at Opscode about scale-testing Chef, we had just the workload in mind. As it happened, one of our life science clients was running a very large scale run against a cancer target. And let us tell you, knowing that hardcore science is being done with the infrastructure below is a very satisfying thing:
That’s right, 10,598 server instances running real science! But we’re getting ahead of ourselves…
Unfortunately, we’re a bit limited in what parts of the science we can talk about, other than to say we’ve done a large-scale computational chemistry run to simulate millions of compounds that may interact with a protein associated with a form of cancer. We estimated this would take about 341,700 hours. This is very cool science, science that would take months to years on available internal capacity! More on that later…
Thankfully, our software has been doing a lot of utility supercomputing for clients, and as we mentioned last week, because of this we’re hiring.
So to tackle this problem, we decided to build software to create a CycleCloud utility supercomputer from 10,600 cloud instances, each of which was a multi-core machine! This makes this cluster the largest server-count cloud HPC environment that we know about, or has been made public to date (the former utility supercomputing leader was our 6,732 instance cluster for Schödinger from 2012).
If this cluster were a physical environment, analysts said
it would occupy a 12,000 sq ft data center space, costing $44 million. Instead,
we created this in 2 hours, with these 10,600 hosts, used it for 9 more, at a
peak cost of $549.72 per hour, and turned it off for a total cost of $4,362.
In creating this environment, we’re also happy to tell you we used a single open-source Chef 11 server on a CC2 class machine! It took a mere two hours to get this capacity from Amazon. We proceeded to do the 341,700 hours of computational chemistry, or 39 compute years, against this protein target, and then shut it down.
So, 10,600 servers, 39 years of compute in 11 hours, on the equivalent of $44 Million in infrastructure, for only $4,362!
Simply put, Chef 11
Now we know that the latest version of Chef, rewritten with an Erlang-PostgreSQL combo for scale, is supposedly faster/better than the Ruby-Couch version, but we just wanted to put it through its paces.
And it passed! Boy did it. It was very cool when we ran knife, and saw:
Here’s the view from our CycleServer plug-ins, showing a heck of a lot of servers that had successfully converged:
Lastly, that’s a heck of a lot of servers running science. And you can see from our ganglia view that Cycle’s software has the cluster red hot and using 99% of the CPU:
So there we have it, we just handled 10,600 servers, and our software built the environment, secured it, scheduled data across, scaled it, and tracked everything for audit/reporting purposes. Chef 11 handled configuration for all of them. But now we’re ready to add zeros here, and so is our software.
If you’ve got a scientific, engineering, or finance questions that need “more zeros”, we’d love to hear from you!