Dec 24, 2018 5:11:48 AM
NetApp has been working with Hortonworks for a number of years to deliver advanced solutions for big data analytics. NetApp offers industry-leading storage and data management, which complements the Hortonworks Data Platform (HDP) built on Hadoop.
In July 2018, Hortonworks announced general availability of HDP 3.0, the first major release since 2013. As you might expect, the big news with HDP 3.0 includes enriched cloud support, as well as support for containers, GPUs, and other enhancements found in Apache Hadoop 3.1. You can now deploy HDP 3.0 in the cloud, as you would on your premises, with automated provisioning on your choice of cloud providers.
NetApp® ONTAP® storage is certified with HDP 3.0 for on-premises systems and in the cloud. For on-premises systems, HDP 3.0 is certified with ONTAP storage running HDFS over a block storage protocol, such as Fibre Channel, iSCSI, or FCoE. In the cloud, HDP 3.0 is certified with Cloud Volumes ONTAP and Cloud Volumes Service running HDFS over NFS. NetApp also offers a native NFS connector, called the NetApp In-Place Analytics Module (NIPAM), which is currently certified on earlier versions of HDP with ONTAP on-premises storage. Certification for version 3.0 is expected in the coming months for both on-premises and cloud storage deployments. We’ll discuss that topic in a future blog.
Cloud Volumes Service is a native cloud service that offers high-performance file storage in the cloud, with NFS and SMB connectivity. It includes rich data management, such as efficient snapshot copies and clones, as well as an integrated backup service. You can deploy Cloud Volumes Service in seconds with three different performance tiers on AWS and Google. Azure offers a similar native Azure service called Azure NetApp Files.
The certified configuration with HDP 3.0 allows our customers to take advantage of the performance and reliability of Cloud Volumes Service in the cloud while benefiting from the data management capabilities built into HDFS.
Early Performance Tests
With functional certification and qualification complete, we wanted to get an idea of how the certified configuration performed.
Although Hortonworks doesn’t require performance testing for certification, we gathered some test data for this configuration without any significant amount of tuning to see how it would perform. We chose to compare the performance numbers against Amazon S3, which is commonly used for data analytics in the cloud. We found the preliminary performance numbers compelling.
Amazon S3 storage is one of the more affordable cloud storage options, but depending on how it is used, it can drive higher costs in other areas, such as compute and object API calls. Lower cost is good, but sometimes spending a little more money in one area may lower costs in another. When it’s deployed as a performance layer of a larger data lake, Cloud Volumes Service can reduce costs for compute and object API calls. You can read more about this topic in the blog titled Analytics in the Cloud: Breaking the Cost vs. Performance Tradeoff.
To measure the performance of Cloud Volumes Service with HDP 3.0 using HDFS over NFS, we chose two test benchmarks, TeraSort and LLAP. Benchmarks aren’t a perfect tool, although they are ideal for testing limits and setting basic expectations for the environment. However, there is no replacement for testing with your actual data and applications.
For the TeraSort testing, we used a 1TB dataset fed by TeraGen. We discovered that NetApp Cloud Volumes Service performed 22% faster than competing object storage in the cloud.
Faster storage performance can result in less compute time (and therefore less cost) waiting for I/O to respond. More importantly, it means that you can get your results faster.
We also tested the same configuration using the LLAP benchmark and found that, on average, Cloud Volumes Service was 16% faster than the leading object storage. Peak throughput was 4323.55 MBps, which is a pretty solid number for this kind of workload in the cloud.
We will continue testing with Hortonworks, and we intend to offer additional guidance on recommended configurations for optimum performance and cost.
The Hortonworks HDP 3.0 certification for NetApp Cloud Volumes Service using HDFS over NFS means that you can deploy this solution for your big data projects with confidence. Even with limited performance testing, this solution demonstrates excellent performance for your data analytics projects.
Environment Setups Details
- 11 Cloud Volumes Service volumes (1 on each DataNode), S3 bucket on AWS
- 11 worker nodes (48vCPUs, 374G), 3 master nodes (36 vCPUs, 68.5G)
- dfs.blocksize=512MB, dfs.replication=2
- yarn.nodemanager.resource.memory-mb=348160, yarn.scheduler.maximum-allocation-mb=348160, yarn.scheduler.minimum-allocation-mb=8192, yarn.scheduler.minimum-allocation-vcores = 1, yarn.scheduler.maximum-allocation-vcores=38
- mapreduce.map.memory.mb=73728, mapreduce.reduce.memory.mb=147456, yarn.app.mapreduce.am.resource.mb=73728
- tez.am.resource.memory.mb=8192, tez.task.resource.memory.mb=73728, tez.runtime.io.sort.mb=6144
- LLAP=90% of YARN queue, hive.server2.tez.sessions.per.default.queue=1, hive.llap.daemon.yarn.container.mb=307200, num_llap_nodes_for_llap_daemons=9, hive.tez.container.size=6144, hive.llap.daemon.num.executors=38, llap_headroom_space=6144, llap_heap_size=204800, hive.llap.io.memory.size=67584
- fs.s3a.fast.upload=true, fs.s3a.fast.upload.buffer=disk, fs.s3a.multipart.size=100M, fs.s3a.block.size=536870912, fs.s3a.connection.maximum=30, fs.s3a.fast.upload.active.block=8, fs.s3a.max.total.tasks=5, fs.s3a.multipart.threshold=2147483647, fs.s3a.threads.keepalivetime=60, fs.s3a.threads.max=106.
- For the dataset and LLAP queries, we used the tpcds git repo found here: https://github.com/hortonworks/hive-testbench