hamburger icon close icon

Google Cloud Dataproc: 10 Best Practices for Google’s Big Data and Analytics Service

In this article we’ll take a look at Google Cloud Dataproc and 10 best practices for using this managed service for Big Data workloads:

What Is Google Cloud Dataproc?

Google Cloud Dataproc is an open-source, easy-to-use, low-cost, managed Spark and Hadoop service within the Google Cloud Platform that enables you to leverage certain open-source tools for processing massive amounts of data, Big Data analytics, and machine learning. It is integrated with other GCP services such as BigQuery, Bigtable, Google Cloud Storage, and is capable of processing datasets of large sizes, such as those used in big data applications.

Google Cloud Dataproc Use Cases

You should use Google Cloud Dataproc when you already have an application on Hadoop but with the continued growth of data and varying demand for resources, you might want to leverage features such as auto-scaling, elastic load balancer, etc. Some of the other use cases for Google Cloud Dataproc include detection of fraud in banking and financial services, real-time AI, IoT data analytics, log data processing, etc.

What Is Included in Dataproc?

The open source platforms Google Cloud Dataproc is built on include:

  • Apache Hadoop: an open-source, Java-based framework that can efficiently store and process big data
  • Apache Spark: an analysis engine that is adept at processing data of massive sizes
  • Apache Pig: a distributed, fault-tolerant data warehouse system that makes it easier to read, write, and manage large data sets
  • Apache Hive: a distributed, fault-tolerant data warehouse system that facilitates reading, writing, and managing large data sets

Cloud Dataproc is also integrated with the following Google Cloud Platform services such as BigQuery, Bigtable, Google Cloud Storage, Stackdriver Monitoring, and Stackdriver Logging.

When Should I Use Dataproc?

Cloud Dataproc is an open-source, fast, simple, low-cost, and managed service for big data batch processing, querying, streaming, and machine learning. It integrates with several other Google Cloud Platform services and migrates Hadoop processing jobs to the cloud. This makes it an ideal service to use if you’re working with massive data sets.

For example, organizations can take advantage of Cloud Dataproc to read and process data from numerous IoT devices and then analyze and predict sales opportunities and growth potential for the company.

Now that we’ve peeked under the hood of Google Cloud Dataproc, let’s take a look at the best practices that you can follow when working with it.

Best Practice #1: Be Specific About Dataproc Cluster Image Versions

Dataproc image versions are an important part about how the service works. Cloud Dataproc uses images to merge Google Cloud Platform connectors, Apache Spark, and Apache Hadoop components into a single package that can be deployed in a Dataproc cluster. Cloud Dataproc would default to the most recent stable version if you've not specified an image version at the time of creating the new cluster.

Specifying an image version when creating clusters is a key best practice because it associates cluster creation steps with a specific Cloud Dataproc version in the production environment. The following command shows how you can link your cluster creation step with a particular minor Cloud Dataproc version:

gcloud dataproc clusters create demo --image-version 1.4-debian9

Best Practice #2: Use Custom Images at the Right Time

While image versions are used to bundle operating systems and big data components, custom images are used to provision Dataproc clusters. Image versions can be used to merge operating systems, the connectors used by Google Cloud, and the individual big data components to form a united package. That package can then be deployed as a whole directly in your cluster.

If you have dependencies that need to be shipped together with the cluster, such as native Python libraries, you should use custom images. Keep in mind that the image must be created from the most recent image in your target minor track.

Best Practice #3: Save Time by Submitting Jobs with the Jobs API

You can take advantage of the Google Cloud Dataproc Jobs API to submit jobs to an existing Dataproc cluster easily. The Cloud Dataproc Jobs API can isolate the permissions of who has access to send jobs to a cluster from who has access to the cluster itself, without the need for setting up the gateway nodes.

When submitting jobs to a cluster you should make a jobs.submit call over HTTP by making use of either the gcloud command-line tool or the Google Cloud Platform Console.

Best Practice #4: Have Complete Control Over Your Initialization Actions

Another best practice is about the placement of initialization actions. What are initialization actions? Why are they needed?

Initialization actions allow you to customize Cloud Dataproc with your own customizations. When you create a Dataproc cluster you can specify these actions in scripts and/or executables. These scripts/executables will then be executed on all nodes of the Dataproc cluster after the cluster has been set up. In a production environment, it is a good practice to perform these initialization actions from a location that you can regulate.

Best Practice #5: Stay Updated on Dataproc Release Notes

It’s always a best practice to stay informed about the product you’re using. Cloud Dataproc publishes weekly release notes that correspond to each change made to Cloud Dataproc. You should review these release notes from time to time to keep you updated.

Best Practice #6: Know What to Do When an Error Takes Place

When an error occurs, the first place you should check for information is your cluster's staging bucket. The error message contains the cloud storage location of the staging bucket pertaining to the cluster. The default filesystem used by a cluster is HDFS, you can change it to Cloud Storage bucket if needed.

Restarting jobs can help mitigate several types of job failures such as out-of-memory and virtual machine restarts. Note that Dataproc jobs don't restart on failure by default—you would need to set them to restart on failure. When you set a job to restart this way, also specify the maximum number of retries per hour (the max value is 10).

Best Practice #7: Take Advantage of Google Cloud Storage

Google Cloud Storage (GCS) provides object storage you can use to store any amount of data and retrieve the data when needed. Cloud Storage provides unlimited storage, low latency and worldwide accessibility and availability. Google Cloud Storage is overtaking the on-premises Hadoop Distributed File System (HDFS) deployments at a fast pace. You should use Google Cloud Storage as the primary data source since you can't scale when using HDFS, which is tied to compute.

If you store data in Cloud Storage over HDFS, you can access the data directly from multiple clusters. This enables you to set up clusters without needing to move the data. If you're using Cloud Dataproc, the Cloud Storage connector comes pre-installed.

Another use case for Google Cloud Storage is as a sink for Google Dataproc export logs. To export logs you need to create a sink. The sink is where the cloud logging API stores the filter as well as a destination for the logs. When exporting logs in Google Cloud Platform, you need to write a filter that can select the log entries that you want to export and choose the destination. The destination can be cloud storage, BigQuery, Pub/Sub topics, etc.

Best Practice #8: Use Preemptible Workers

You can have both primary and secondary workers in a cluster. Although a cluster can have both primary and secondary workers, it is important to note that primary workers are required. If you don't specify primary workers when you create the cluster, Cloud Dataproc will automatically add them for you.

Secondary workers don't store data, they are just processing nodes. Hence you can use secondary workers to scale compute only, i.e., without scaling storage. Secondary workers are of two types: preemptible and non-preemptible. The number of workers who can be preempted are less than half of the available number of workers in the cluster. In a cluster, the total number of workers available is simply the sum of primary and secondary workers.

Best Practice #9: Identify a Source Control Mechanism

You should identify a source control to store your code—one that works for both your developers as well as the analytics users. There are various source control systems available, and you’ll need to choose the one that provides the features for your deployment requirements.

Best Practice #10: Use Cloud Authentication and Authorization Policies

To secure data in transit, Cloud Dataproc supports security and access control features such as authentication and authorization, as well as encryption techniques such as SSL.

You should take advantage of the controls available in Google Cloud Platform's Identity and Access Management to the maximum extent possible.

Getting More from Dataproc with Cloud Volumes ONTAP

Cloud Volumes ONTAP is a data management solution for AWS, Azure, and Google Cloud. This industry-leading solution built on NetApp's ONTAP storage software that enables you to manage your data efficiently while at the same time leverage the flexibility and cost benefits of the cloud.

For Dataproc and other Google Cloud deployments, Cloud Volumes ONTAP offers you a way to lower your storage costs, protect it better with instant, space-efficient snapshot copies, data encryption, integrated block data replication, high availability, high performance caching, and more. You can take advantage of Cloud Volumes ONTAP to optimize your cloud storage costs and increase application performance, providing a ready-to-go, easy-to-provision solution to address storage availability.

New call-to-action
Yifat Perry, Technical Content Manager

Technical Content Manager