Maximizing data throughput: Multi-stream data transfer into Amazon EC2
It is common for cloud computing articles to talk at length about the abundant hardware resources the cloud can offer the modern researcher or analyst, but little is typically said about the back end data store available with cloud computing. Before any research in the cloud can take place, data must be staged in a manner that is accessible to your cloud-based compute resources. It becomes non-trivial to perform the staging portion of your cloud use if your data sets are large.
The Amazon EC2 cloud provides large quantities of hardware suitable for high speed, high throughput scientific computing. Coupled with the AWS storage and Amazon S3 system it makes a formidable platform for anyone looking to do large scale, scientific computing on quantities of file-based data. In this post we’ll explore data ingress to EC2 as download speeds out of EC2 are typically much higher, both from consumer grade and enterprise level Internet connections, as we typically see more upload than download in the practical use of AWS services. Large data sets get transferred in to EC2, working data stays on cloud-local storage, and summarized, compact results are brought back from the cloud.
Let’s look at a few common cases for moving a large data set into AWS-hosted storage and explore the transfer rates, benefits and drawbacks of each approach.
Case #1: Consumer Grade Internet Connection
It is trivial to saturate the upstream network pipe of a consumer grade cable or DSL Internet connection transferring using only a single stream to EC2. With strong encryption on the ingress stream, CPU demands are insignificant for modern personal computing hardware. Given a 512 kilobit upload speed, it is possible to upload 5.1 gigabytes in one day with no compression.
Case #2: Mid-Range Business-Class Internet Connection
Among our clients we see average business-class Internet connections with six to eight megabits of upload performance. Even with the more capable six megabit upload connection, it is still possible to saturate the line with a single, encrypted rsync into EC2. With no compression, it is possible to transmit 64 gigabytes a day from a mid-range, business-class Internet connection.
Case #3: Enterprise-Level Internet Connection
With our larger clients it is not uncommon to see Internet connections with sustained data-out capabilities of 100 megabits or more. In this class of Internet connection WAN principles such as frame size, packet order and retransmissions start to come in to effect and require consideration to optimize the ingress rate. With little to no optimization, we have seen upper bounds on a single Internet-bound network stream of around 18 megabits. Even without compression or multiple streams, an Internet connection operating at this outbound rate can transfer nearly 200 gigabytes in a 24 hour period. There is, of course, potential for the single-stream transfer to be further optimized but often enterprise network teams have a philosophical problem of a single application leveraging more than 50% of an Internet connection shared by potentially tens of thousands of people.
Special Case #1: Parallel S3 Streaming
If the inbound files are relatively homogeneous and reasonably large, we have seen instances where a parallel upload into S3 can be advantageous. S3 is globally distributed with massively scalable ingress. When file sizes are similar, it is possible to saturate an outgoing network pipe without worrying about one stream of data lagging behind the others because it is larger. It should be noted that once data is in S3, it can be presented via s3backer to a large number of EC2 instances at high speeds using the same principles of S3’s massive scalability.
Special Case #2: Parallel Throttled rsync
There are cases where a single-stream approach out of an enterprise-class connection is insufficient for the data transfer needs. On a massive, shared network connection two undesirable conditions can be observed:
- A single upload will sometimes fall very short of the available network bandwidth of a particular site due to the network topology; and
- Multiple uploads will sometimes overwhelm a shared network connection causing complaints from other users and potential business disruption.
We have seen very good results packaging rsync transfers into multiple parallel streams, where each stream is limited to a rigid amount of bandwidth consumption, in situations where sustained throughput needs to be consistent and relative peak-free, but overall throughput should not user more than the network administrators are comfortable with. This approach allows us to push the limits of a shared network connection without saturating it. We often come as close as 90% of the available network bandwidth, meaning that we can perform a significant amount of data transfer into the cloud in a minimal amount of time, with no perceivable disruption to the other users and applications on the network.
The EC2 ingress rates are quite capable of handling large data transfers from connections with significant upstream bandwidth available to them. Parallel transfers with capped parallel streams can help take maximum advantage of enterprise-class outbound connections without imparting any serious latency issues on other users within your organization.
If the data is compressible it is possible, using tools such as rsync, to greatly decrease the number of bits transmitted over the wire. In our experiments we were able to increase upload rates by a factor of 3 on heavily compressible ASCII text data used for genetic computations. We have pushed data set uploads of 250 gigabyte to EC2 in 22 hours by using a single encrypted and compressed rsync stream on a 8 megabit upload link.
It is unlikely that the ingress performance into Amazon for even a single EC2 instance will ever come into play unless you have an outbound Internet connection with sustained rates of a gigabit per second or more. Nevertheless, we recommend running at least an m1.large machine for data reception on the EC2 side of the transfer to handle the encryption demands. We have seen inbound transfer rates of encrypted data surpass the capabilities of the basic m1.small instance type.
When performing HTTPS based transfers a potentially severe impact on transfer rates is traversing corporate infrastructure for HTTP proxy and DNS filtering. These HTTP security architectures tend not to be designed with sustained data transfer performance characteristics in mind but rather control over interactive web browsing. If at all possible, avoid these pieces of infrastructure in your data transfer path.
Finally: if your ability to transfer exceeds your available Internet connection, fear not. There is a solution. Simply mail Amazon a USB disk drive or drive array with your data and they will promptly load it into an S3 bucket for you. As amazingly low tech as this approach sounds, for multi-terabyte data transfer it can be surprisingly quick. After all, you should never under-estimate the bandwidth of a truck full of drives...