More about Azure Big Data
Best Practices for Using Azure HDInsight for Big Data and Analytics
Azure HDInsight is a secure, managed Apache Hadoop and Spark platform that lets you migrate your big data workloads to Azure and run popular open-source frameworks including Apache Hadoop, Kafka, and Spark, and build data lakes in Azure.
This article will look at some of the best practices for using Azure HDInsight to help you utilize this service to the fullest.
- Metastore Best Practices
- Scaling Best Practices
- Architecture Best Practices
- Infrastructure Best Practices
- Migration Best Practices
- Performance Best Practices
- Storage Best Practices
- Security and DevOps Best Practices
What Is Azure HDInsight?
Azure HDInsight is a managed, open-source, analytics, and cloud-based service from Microsoft that can run both on the cloud as well as on-premises and provide customers broader analytics capabilities for big data. This helps organizations process large quantities of streaming or historical data. HDInsight is a cost-effective, enterprise-grade service and an Apache Hadoop-based distribution running on Azure.
Azure HDInsight Metastore Best Practices
The Apache Hive Metastore is an important aspect of the Apache Hadoop architecture since it serves as a central schema repository for other big data access resources including Apache Spark, Interactive Query (LLAP), Presto, and Apache Pig. It's worth noting that HDInsight's Hive metastore is an Azure SQL Database.
You’ve got two options for HDInsight metastore: default metastores or custom metastores.
- Default metastores are created free of cost for every cluster type but a default metastore cannot be shared with other clusters.
- Custom metastores are recommended for production clusters; they can be created and removed without losing metadata. Use a custom metastore to isolate compute and metadata and it should be backed up periodically.
When you deploy HDInsight, a default Hive metastore is available which is transient—it will be deleted as soon as the cluster is deleted. To avoid the Hive metastore being deleted when the cluster is deleted, you can store it in Azure DB.
You can monitor your metadata store for performance using monitoring tools such as Azure Portal and Azure Log Analysis. Make sure that your metastore and HDInsight are located in the same region.
Azure HDInsight Scaling Best Practices
Some services should be started and stopped manually when scaling down a cluster. This is because scaling the cluster when there are jobs running might result in job failures.
HDInsight provides support for elasticity—a feature that enables you to scale the number of worker nodes in your clusters up and down. For example, you can shrink a cluster during the weekends and expand it at times of high demand. Likewise, it’s a good idea to scale up the cluster before doing periodic batch processing to ensure that it has enough resources and scale down the HDInsight cluster to fewer worker nodes when processing has ceased, and utilization decreased.
Azure HDInsight Architecture Best Practices
Here’s the list of Azure HDInsight Architecture best practices:
- When migrating an on-premises Hadoop cluster to Azure HDInsight, it is recommended that you use multiple workload-clusters rather than a single cluster. But using many clusters over the long term will drive up your costs unnecessarily.
- Use transient on-demand clusters so that the clusters are deleted as soon as the workload is complete. This helps in saving on resource costs because HDInsight clusters may not be frequently used. It should be noted that when you delete a cluster the associated meta-stores and storage accounts will not be deleted, which means you can use them later to recreate the cluster if needed.
- Because storage and compute on HDInsight clusters are not co-located and may be located in Azure Storage, Azure Data Lake Storage, or both, it’s best to separate data storage and data processing. Decoupling data storage from compute will reduce the storage cost, enable you to use transient clusters, and share data as well as scale storage and compute separately.
Azure HDInsight Infrastructure Best Practices
Here are the Azure HDInsight Infrastructure best practices:
- Capacity planning: For capacity planning of your HDInsight cluster, the key choices you can make to optimize your deployment include choosing the best region,storage location, VM size, VM type, and number of nodes.
- Script actions: Take advantage of script actions to customize your HDInsight clusters. You need to ensure that the script actions are stored on a URI and are accessible from the HDInsight cluster. Check the availability of Hadoop components in HDInsight and the supported HDInsight versions.
- Use Bootstrap: Another good idea is to use Bootstrap to customize HDInsight configs. This allows you to change the config files such as core-site.xml, hive-site.xml and oozie-env.xml.
- Edge notes: Edge nodes can be used to access the cluster as well as test and host client applications. Azure HDInsight provides you with the elasticity to scale-up and scale-down clusters as needed. By placing the cluster in the same geographic region as the data, you can minimize read and write latency.
Azure HDInsight Migration Best Practices
Here's the list of Azure HDInsight Migration best practices:
- Migration Using Scripts: The Hive metastore can be migrated using scripts or using DB replication. If you're migrating Hive metastore with scripts, create Hive DDLs from the on-premises Hive metastore, edit the generated DDL to replace the HDFS URL with WASB/ADLS/ABFS URLs, and then run the modified DDL on the metastore. The metastore version must be compatible with on-premises and the cloud.
- Migration Using DB Replication: If you're using DB replication to migrate your Hive metastore, you can take advantage of the Hive MetaTool to replace HDFS URLs with WASB/ADLS/ABFS URLs. Here’s an example code:
./hive --service metatool -updateLocation
When migrating on-premises data to Azure, you have two options: migrating offline or migrating over TLS. The best choice for you will likely come down to the amount of data that you have to migrate.
- Migrating over TLS: To migrate over TLS, you can take advantage of Azure Storage Explorer, AzCopy, Azure Powershell, or Azure CLI to transfer data to Azure storage over the network.
- Migrating offline: To ship data offline, you can also take advantage of DataBox, DataBox Disk, and Data Box Heavy devices to transfer large amounts of data to Azure. Alternatively, use native tools to transfer data over the network such as Apache Hadoop DistCp, Azure Data Factory, or AzureCp.
Azure HDInsight Performance Best Practices
Here are the Azure HDInsight performance considerations:
- Increase parallelism: During the data transfer processes when using Apache Hadoop DistCp, you can increase parallelism to decrease the data transfer length and improve performance using more mappers for DistCp. To minimize the impact of failures, use multiple DistCp jobs—if one fails, you can restart that particular job instead of all the others. If you have a few large files, consider splitting them into 256-MB chunks. You can increase the number of threads running at a given point in time as well.
- Monitor Performance: With Azure HDInsight, you will gain helpful insight into how your cluster is doing. Use it to retrieve metrics about CPU, Memory, and Network usage. You may configure Azure Monitor notifications to be triggered when the value of a metric or the results of a query satisfy a predefined condition. You can have triggering options via email, SMS, push, speech, an Azure Feature, a Webhook, or an ITSM.
Azure HDInsight Storage Best Practices
Selecting the right storage system for your HDInsight clusters is important because every workload has different business requirements and that has to be reflected at the storage layer. Azure storage options range from Azure Storage, Azure Blob Storage, and Azure Data Lake Store (ADLS).
The following are the Azure HDInsight Storage best practices:
- Storage Throttling: A cluster might often encounter performance bottlenecks that occur because of blocking input/output (I/O) operations when the running tasks attempt to perform more I/O than the storage can handle. This blocking generates a list of I/O requests dealt with once the current I/Os are complete. This is because of capacity throttling, a cap levied by the storage service per the service level agreement (SLA). You can avoid throttling by reducing your cluster size, configuring the self-throttling settings or increasing the bandwidth allocated for your storage account.
- Decoupled compute and storage: In HDInsight, storage is isolated from compute resources. This means that even if you turn off the compute portion of your cluster, the data in the cluster would still remain intact and accessible.
- Choosing the right storage type: By default, HDInsight uses Azure Storage. You may choose one or more Azure Blob Storage accounts to store data but note that it must be of type Standard_LRS since Premium_LRS is not supported.
- Azure Data Lake Store: ADLS is another option you have for data storage. It is a distributed file system that is optimized for running parallel processing jobs. There are no storage size limits—neither file size nor account storage limits.
- Use multiple accounts: Try not to restrict your HDInsight cluster to use just one storage account—it’s recommended to have more than one storage account for a particular HDInsight cluster and one container per storage account. This is because each storage account offers extra networking bandwidth enabling computing nodes to complete their jobs as quickly as possible.
The recommended number of storage accounts for a 48-node cluster is 4-8 storage accounts per cluster.
Azure HDInsight Security and DevOps Best Practices
Use Enterprise Security Package (ESP) to protect and maintain the cluster, which provides Directory-based authentication, multi-user assistance, and role-based access control. It is worth noting that ESP is available in several cluster types such as Apache Hadoop, Apache Spark, Apache Hbase, Apache Kafka, and Interactive Query (Hive LLAP).
These steps are important for ensuring security in your HDInsight deployment:
- Azure Monitor: Take advantage of monitoring and alerting using Azure Monitor.
- Stay on top of upgrades: Make sure to upgrade to the most recent HDInsight version, as well as the latest security updates, OS patches and rebooting nodes.
- Enforce end-to-end enterprise security, you must use auditing, encryption, authentication, and authorization, and even a private and protected data pipeline.
- Azure Storage Keys should also be secured using encryption techniques. Shared Access Signature (SAS) is a good way to restrict access to your Azure storage resources. The data that is written to Azure Storage is automatically encrypted using Storage Service Encryption (SSE) and replication.
Upgrade to the latest HDInsight version at regular intervals of time. To do this, you can follow the steps outlined below:
- Build a new test HDInsight cluster with the most current HDInsight update.
- Check the current cluster to ensure that the workers and workloads are working properly.
- Change jobs, applications, or workloads as needed.
- All temporary data maintained locally on the cluster nodes should be backed up.
- Delete the current cluster.
- Using the same default data and metastore, build a new cluster in the same virtual network subnet as the previous one for the latest HDInsight version.
- Import any backups of temporary files.
- Lastly, start jobs or continue processing with the new cluster.
Better Than Best Practice: Enhance Azure HDInsight with Cloud Volumes ONTAP
We’ve seen how you can get the best results out of an HDInsight deployment with best practices, but if you need to reduce your costs and enhance your performance even more, NetApp has a solution.
Cloud Volumes ONTAP is a cloud data management solution available natively on AWS, Azure, and Google Cloud that is adept at delivering secure, proven storage management services.
Key benefits to using Cloud Volumes ONTAP for Azure with your HDInsight deployment include:
- Reduced Azure storage costs through cloud data storage efficiencies as high as 70%
- Azure high availability with RPO=0, RTO<60 second business continuity
- Multicloud and hybrid interoperability
- Zero-cost, Instantly created writable clones of HDInsight clusters
- Automatic tiering of infrequently used data from Azure Disk to Azure Blob storage and back when needed for processing.