Data analytics is an important part of business intelligence and Amazon EMR is one way AWS is making analytics easier to deploy. Amazon EMR gives users a wide range of capabilities for avoiding the hassles of managing analytics workloads, such as deploying short-term analytics clusters in just a few minutes or setting up permanent clusters for constantly running jobs.
In this article we want to go through some general considerations about running Amazon EMR directly on either Amazon EBS or Amazon S3. Plus, we’ll see how NetApp® Cloud Volumes ONTAP, with the help of the Netapp In-Place Analytics Module, allows existing data workloads residing in NAS storage to be analyzed in all major big data frameworks such as Hadoop, Presto, HBase, etc. Simply by adding this module to the EMR cluster allows you to get the advantages of EMR without having to consume additional storage, replicate existing data sets, or make extremely large amounts of API calls, saving significant costs and operational efforts.
What Is Amazon EMR?
Amazon EMR is the AWS platform for petabyte-scale Big Data workload analysis. EMR uses Amazon EC2 instances to quickly deploy the computing cluster that runs the analytic jobs with open-source tools such as Apache Hive, Spark, HBase, or Hadoop. On-prem storage, Amazon EBS, or Amazon S3 can be the underlying storage to serve as the repository for the data lake.
Running Amazon EMR on Amazon EBS
There are some points to consider when running Amazon EMR analytics on data stored in Amazon Elastic Block Storage (AWS EBS) volumes.
The first important aspect is related to management. Currently, AWS supports a maximum 16 TB volume size for each EBS volume. Depending on the size of the data set, you might need to create several volume mount-points for your cluster nodes in order to span the total data set size for persistent storage.
If your EMR workload is going to be run against a pretty well-defined data lake size, then this might represent just a small admin overhead. If you already know the exact size, then you know exactly how much storage you need and that’s not going to change. However, if the EMR cluster is going to run on a more dynamic data set, one which may increase its size, then expanding storage on a running cluster can become a problem. You either need to add more cluster nodes, remove expendable data or launch a whole new cluster with all the added capacity.
One response to this under-provisioning would be to use the elastic volumes feature of EBS, in which volumes can dynamically increase their size, performance specifications, and volume type. In order to do this, you need to implement a bootstrap action on your EMR cluster which will program the cluster to monitor disk space and increase volume size seamlessly when detecting a volume reaching 90% used capacity. So, the main point is that sizing needs some admin attention to it. The volume size limit of 17 TB means there is a cap to pay attention to; since analytics data sets are extremely large, this is a real concern.
Another important consideration is cost. Running an EMR cluster that requires persistent storage means paying for the EMR service plus Amazon EC2 instances for compute, plus Amazon EBS volumes for storage. Storing the dataset on EBS using HDFS (Hadoop Distributed File System) means that you need to attach the EBS volumes to the nodes’ local file systems and then account for the HDFS replication factor, which in clusters of 10 or more nodes is multiplied by three. That means if your data set weighs 100 TB, you need to account for 300 TB due to this factor. Your storage costs are essentially going to triple.
Running Amazon EMR in Amazon S3 Buckets
The story is a little bit different with Amazon S3. As we know, S3 buckets do not have sizing limitations. You can have a vast number of objects in them. You can do some very outrageous things with this service, such as trying to mount S3 as drive for a file system. Management issues such as the ones described above with EBS simply don’t happen here. You would only need to create new computing nodes if the existing ones start falling behind the storage consumption rate.
EMR uses a file system called EMRFS to interact directly with S3. All the data stored in S3 is read by Hadoop in parallel HTTPS streams. What’s the catch? While management issues are eliminated and storage is less costly, Amazon S3 charges for this HTTP calls. Every GET, SELECT, PUT, POST, and other operation will come with a charge. Since data analytics often based upon data sets of considerable size, this affordable workaround may wind up not being very cost effective in the end. Do the math and check if this would be your best scenario. If your costs will increase, it’s better to find an alternative model.
Getting More for EMR with Cloud Volumes ONTAP
If you have an existing Cloud Volumes ONTAP cluster on AWS, we have good news for you: Running EMR analytics on your stored NAS data in Cloud Volumes ONTAP is now possible with just a few configuration steps.
This is all possible thanks to the NetApp In-Place Analytics Module. This plug-in allows Big Data frameworks, such as Amazon EMR, to run analytics on existing NFS storage. The plug-in works with all major Hadoop distributions, giving you the ability to run Apache Hadoop, Tachyon, Apache HBase, Pig, Hive, or Apache Spark on AWS. It’s also easily installed in just a few steps.
The In-Place Analytics Module decouples analytics from the storage layer, allowing Hadoop HDFS to analyze underlying NFS data, and making it really easy to run jobs on already-existing internal workloads without having to create a new storage silo for HDFS. It also supports the latest high-speed 10 and 40 GbE which provides a wide aggregate bandwidth.
Using this plug-in along with the known Cloud Volumes ONTAP functionalities can bring you numerous important advantages:
- A single storage back end to service both enterprise and analytics data.
- The ability to analyze file-based data sources such as emails, log files, text files, and source code repositories.
- NetApp FlexClone® data cloning technology allows you to instantly deploy volume clones on which you can run variations of analytics while keeping your main volumes dedicated to production workloads.
- Robust data reliability with NetApp storage snapshots, SnapMirror® data replication, and AWS high availability pair deployments.
- Reduced storage costs: Instead of having three HDFS copies, you can take advantage of the Cloud Volumes ONTAP high availability configuration which uses just two storage nodes that replicate all the data between each other and that can automatically failover if one should go down. You can save even more on cloud data storage costs through the signature ONTAP storage efficiency features.
- Tiering cold data automatically between low-cost Amazon S3 object storage and Amazon EBS disks as needed.
- No API costs when running EMR on data hosted by Cloud Volumes ONTAP.
You can also consider the possibility of uploading your entire data lake to the public cloud and host it in Cloud Volumes ONTAP if it’s not there already. The data management features of ONTAP will make the storage costs of cloud computing and big data even lower than hosting data directly on Amazon EBS, which you can see demonstrated with the NetApp TCO Calculator, and will also save a lot of time. By moving all your data to the cloud with Cloud Volumes ONTAP, you eliminate the need to move the data between your repository and EMR whenever you run analytics jobs. All the data is in the same place and exactly where you need it, when you need it.
A Big Data Platform Without the Big Cost
As you can see, the Netapp In-Place Analytics Module, combined with EMR, makes it easy and cost-effective to run analytics on your existing NAS repositories, saving you from all the investments in additional storage and eluding the complicated management tasks that are involved in analyzing Big Data on AWS.