More about AWS Big Data
- MongoDB on AWS: Managed Service vs. Self-Managed
- Cassandra on AWS Deployment Options: Managed Service or Self-Managed?
- Elasticsearch in Production: 5 Things I Learned While Using the Popular Analytics Engine in AWS
- AWS Data Lake: End-to-End Workflow in the Cloud
- AWS ElastiCache for Redis: How to Use the AWS Redis Service
- AWS Data Analytics: Choosing the Best Option for You
- AWS Big Data: 6 Options You Should Consider
What is AWS Big Data?
AWS big data refers to the collection, storage, and use of big data in AWS. It is supported by a range of services and capabilities, including analytics, highly scalable storage, and wide support for compliance regulations.
6 Big Data Analytics Options on AWS In this article, you will learn:
- How Do AWS Big Data Solutions Work?
- AWS Big Data with NetApp Cloud Volumes ONTAP
6 Big Data Analytics Options on AWS
AWS’s most impressive support for big data implementations comes in the form of analytics solutions. The provider offers a variety of services that you can use to automate data analysis, manipulate datasets, and derive insights.
Kinesis is a service that enables you to collect and analyze real-time data streams. Supported streams include Internet of things (IoT) telemetry data, website clickstreams, and application logs. You can export data from Kinesis to a variety of AWS services, including Redshift, Lambda, Elastic MapReduce (Amazon EMR), and S3 storage.
You can also use Kinesis to build custom applications for streaming data using the Kinesis Client Library (KCL). This library provides support for dynamic content, alert generation, and real-time dashboards.
EMR is a framework for distributed computing that you can use to process and store data. It is based on Apache Hadoop and clustered EC2 instances. Hadoop is a well-established framework for big data processing and analysis.
When you implement EMR, it provisions, manages, and maintains your infrastructure for Hadoop, enabling you to focus on analytics. EMR supports the most commonly used Hadoop tools, including Spark, Pig, and Hive.
Glue is a service that enables you to process data and perform extract, transform, and load (ETL) operations. You can use it to clean, enrich, catalog, and transfer data between your data stores. Glue is a serverless service meaning you are only charged for the resources you consume, and you do not have to worry about provisioning infrastructure.
Amazon Machine Learning (Amazon ML)
Amazon ML is a service that provides support for developing machine learning models without ML expertise. It includes wizards, visualization tools, and pre-built models to get you started. The service can walk you through evaluating data for training and optimizing your trained model to fit business needs. Once complete, you can access your model’s output through batch exports or API.
Redshift is a fully-managed data warehouse service that you can use for business intelligence analytics. It is optimized for large data queries of structured and semi-structured data using SQL. Query results are saved to S3 data lake storage and can be ingested by a variety of analytics services, including SageMaker, Athena, and EMR.
Redshift also includes a feature called Spectrum that you can use to query data in S3 without performing ETL processes. This feature evaluates your data storage and requirements for the query and optimizes the process to minimize the amount of S3 data to be read. This helps minimize costs and speeds query times.
QuickSight is a service for business analytics that you can use to perform ad-hoc data analysis and build visualizations. You can use it to ingest numerous data sources, including from on-premises databases, exported Excel or CSV files, and AWS services, including S3, RDS, and Redshift.
QuickSight uses a “super-fast, parallel, in-memory calculation engine” (SPICE). This engine is based on columnar storage and uses machine code generation to produce interactive queries. When you perform queries, the engine persists the data until it is manually deleted by the user to ensure that subsequent queries are as fast as possible.
How Do AWS Big Data Solutions Work?
AWS offers numerous solutions to help you address your entire big data management cycle. These tools and technologies make it possible and cost effective to collect, store, and analyze your data sets. The tools available support the big data cycle from collection to consumption.
Collection solutions focus on helping you accumulate your raw data, structured and unstructured. Solutions can integrate natively with AWS services or ingest data gathered from exports.
In AWS, big data collection is supported by services and capabilities that include:
- Kinesis Streams and Kinesis Firehose for real-time data stream ingestion
- Integration with a range of services and data sources through manual import or API
Storing big data requires highly scalable solutions that can handle data before and after processing. These solutions are accessible to a variety of processing and analytics services and can typically be tiered to help you reduce storage costs.
In AWS, big data storage is supported by the following services:
- S3 and Lake Formation for object storage
- S3 Glacier and Backup for backups and archives
- Glue and Lake Formation for data cataloging
- Data Exchange for third-party data
Processing and Analysis
Processing and analysis solutions enable you to transform raw data into data consumable for analytics. This generally involves sorting, aggregating, and joining data but can also involve applying new data schemas or translating data into different formats.
In AWS, processing and analysis are supported by a range of services including:
- Elasticsearch Service for operational analytics
- Athena for interactive analytics
- Redshift for data warehousing
- EMR for big data processing
- Kinesis Analytics for real-time analytics
Consumption and Visualization
Consumption and visualization solutions help you derive and share insights from your data. These solutions enable you to explore your datasets and analysis and highlight those that are relevant or provide the most accurate predictions or recommendations.
In AWS, consumption and visualization of big data is supported by:
- Quicksight for visualizations and dashboards
- Deep Learning AMIs and Sagemaker for machine learning and predictive analytics
AWS Big Data with NetApp Cloud Volumes ONTAP
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
In particular, Cloud Volumes ONTAP helps in addressing database workloads challenges in the cloud, and filling the gap between your cloud-based database capabilities and the public cloud resources it runs on.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, the built-in storage efficiency features have a direct impact on costs for NoSQL in cloud deployments. The data protection and flexibility provided by features such as snapshots and data cloning give NoSQL database administrators and big data engineers the power to manage large volumes of data effectively.
Learn More About AWS Big Data
AWS Data Lake: End-to-End Workflow in the Cloud
A data lake is a flexible, cost effective data store that can hold very large quantities of structured and unstructured data. It allows organizations to store data in its original form, and perform search and analytics, transforming the data as needed on an ad hoc basis. Learn how AWS data lake solutions automate the entire data lake process, from data ingestion to analysis, using Data Lake Formation, Glue, Lambda, EMR, and more.
AWS Data Analytics: Choosing the Best Option for You
Big data solutions help organizations to efficiently store, catalogue, search, and analyze their data. AWS offers a wide range of services, each offering different capabilities. This article introduces common AWS Data Analytics offerings, and provides assessment questions.
AWS ElastiCache for Redis: How to Use the AWS Redis Service
AWS ElastiCache for Redis is the fully managed service for Redis, the open-source database and cache technology fast growing in importance in enterprise DevOps deployments. Using AWS ElastiCache for Redis, engineers can easily manage all aspects of their Redis clusters being deployed on AWS, reducing operational costs for key tasks such as monitoring, maintenance, backing up data, recovering from failures, and updating software.In this blog we take a closer look at Redis, AWS ElastiCache for Redis, and how they can be used as critical parts of AWS database deployment, with a full step-by-step walkthrough to help you get started.
MongoDB on AWS: Managed Service vs. Self-Managed
MongoDB is a NoSQL database that can be a key enabler for AWS big data workloads. But with two different deployment options to choose from—either the managed service from AWS that supports MongoDB (Amazon DocumentDB) or self-managing your MongoDB database built on native AWS EC2 compute instances—users may need guidance on choosing which is the best way to run MongoDB on AWS.
This post compares Amazon DocumentDB managed service with the self-managed, EC2-based MongoDB deployment option, and shows how Cloud Volumes ONTAP, the data management platform from NetApp can bridge the gap and enhance MongoDB on AWS deployments.
Read more in MongoDB on AWS: Managed Service vs. Self-Managed
Elasticsearch in Production: 5 Things I Learned While Using the Popular Analytics Engine in AWS
This article gives a firsthand account of using Elasticsearch in production on AWS, giving insight into five important lessons that it’s important to know if you’re just getting started with the fully managed Elasticsearch service on AWS, Amazon Elasticsearch. Find out the expectations and reality in terms of operational and management overhead effort and the unique extra features the AWS managed service has compared with the open-source version, and how costs and performance stack up.
Cassandra on AWS Deployment Options: Managed Service or Self-Managed?
Apache Cassandra started as a way for Facebook to search inboxes, but it’s grown into an open-source, scalable NoSQL database that is highly performant and highly available. How will it affect your AWS big data workloads?
A big part of answering that question is deciding which deployment option you’ll choose using: the managed service for Cassandra on AWS, Amazon Keyspaces, or deploying your own Cassandra database using AWS-native EC2 instances. This article will show you the pros and cons of each approach and how Cloud Volumes ONTAP can help.
See Our Additional Guides on Key Cloud Storage Topics
We have authored in-depth guides on several other topics that can also be useful as you explore the world of cloud storage.
File shares support some of the most important workloads that enterprise businesses rely on, and the resources of the public cloud have created interesting new possibilities. Every major public cloud provider now offers its own cloud file sharing service, each with its own target workloads and considerations. But not every enterprise will find what they’re looking for in a fully managed, all-cloud service.
See top articles in our cloud file sharing guide:
- File Share Service Challenges in the Cloud
- Cloud File Sharing Services: Open-Source Solutions
- Cloud Availability Nightmares and How to Avoid Them in Cloud File Sharing
Multicloud strategies are becoming more popular as organizations seek to optimize their cloud services and deployments. These strategies can help you prevent vendor lock-in, increase your flexibility, and help you optimize costs.
This guide explains what multicloud storage is, how it works, what it’s used for, the core requirements for this storage, and how Cloud Volumes ONTAP supports it.
See top articles in our multicloud storage guide:
- One Cloud Out of Many: Why Enterprises Are Turning to Multicloud and Hybrid Cloud Architectures
- Multicloud Architecture: Partitioned, Cloud Burst and DR
- Multicloud Deployment: Creating a Plan With Cloud Volumes ONTAP
AWS offers a range of database services and support to try and meet all its clients needs. Many of these services are fully managed to help reduce your IT workload and enable you to store and use data as simply as possible.
This guide explains what AWS database support is available, what database services are available, and how you can migrate your databases to AWS.
See top articles in our AWS database services guide:
- AWS Database as a Service: DBaaS Types and Case Studies
- SQL Server in AWS: Managed Service vs Managed Storage
- AWS Oracle RDS: Running Your First Oracle Database on Amazon
Snapshots are a common method for natively backing up cloud data and services. This method enables you to save point in time backups which can be restored when needed.
This guide explains what types of storage snapshots are available, what AWS snapshots are, and how to use AWS snapshots.
See top articles in our AWS snapshots guide:
- Azure and AWS Snapshots Deep Dive: Cloud Volumes Snapshots
- Snapshots Deep Dive: AWS Snapshots and Azure Snapshots
- Understanding AWS Snapshot Pricing: Data Transfer and Storage Costs
Nearly every production cloud deployment has one or more databases. These tools provide support for applications, enable workloads, and organize your data meaningfully. Having databases available that support all your needs is essential and Azure offers a range to choose from.
This guide explains what Azure database workloads are supported, how databases work in Azure, and what services are available.
See top articles in our Azure database guide:
- Azure Oracle: Your First Oracle Database on Azure
- Azure Database Migration Service: The Ultimate Guide
- Azure SQL Database: 18 Options for SQL Server on the Cloud
Azure provides a wide variety of services to its users to help you manage your cloud data and services reliably. Azure Backup is one such service that can help provide data loss protection and peace of mind.
This guide explains what Azure Backup is and how to use it to backup your Azure data.
See top articles in our Azure Backup guide:
- Storage Options for Lower Azure Storage Costs and Azure Backup Costs
- The 5 Enterprise-Grade Azure Features You Need to Know About: Azure Backup, Security, and More
- Using Azure Backup Server to Backup Workloads and Files to Azure
Azure File Storage
Storing file data in Azure is simple through Azure File Storage service. This service enables you to store files across cloud and on-premises resources, enabling you to flexibly and securely share data and workflows.
This guide explains what Azure File Storage is, common use cases for Files, management concepts and components of the service, how data is accessed and the architecture of the service, and some best practices for securing your data.
See top articles in our Azure file storage guide:
Azure Files is one of several storage services available to users in Azure. It is a service designed to replicate file shares like those commonly used on premises. With this service, you can smoothly transition your files to the cloud and allow file sharing across your teams.
This guide explains what Azure Files is, how it complements other storage services, pricing and use cases for Files, and pros and cons you should be aware of.
See top articles in our Azure Files guide:
- Azure NetApp Files Register
- SMB File Sharing
- NFS and SMB - A Simple File Service Environment in Azure.
Google Cloud offers a variety of storage options for you to choose from. These services form the base of many other services in the cloud and understanding what your options are can help you manage your cloud more efficiently.
This guide explains what Google Cloud Storage options exist and their common uses.
See top articles in our Google Cloud storage guide:
- Cloud File Sharing Services: Google Cloud Filestore
- Google Cloud Storage Encryption: Key Management in Google Cloud
- Google Cloud Storage Pricing: Get the Best Bang for Your Buckets
Software developers and DevOps engineers are packaging applications into lightweight units called containers. Kubernetes helps manage and scale containers across clusters of physical machines.
In this environment, Kubernetes storage becomes a significant challenge. By default, containers are ephemeral, meaning that any transient data on the container is lost when it shuts down. However, Kubernetes provides several options for persistent storage.
See top articles in our Kubernetes guide:
- An Introduction to Kubernetes
- Understanding Kubernetes Persistent Volume Provisioning
- Kubernetes Persistent Storage: Why, Where and How
Google Cloud’s specialty is flexibility and integration of services and this extends to its database services. In Google Cloud you have a wide variety of database deployments, models, and support to choose from.
This guide explains your options for deploying databases in the cloud, what Google Cloud database services are available, and how to choose the right service for you.
See top articles in our Google Cloud database guide:
- Google Cloud SQL: MySQL, Postgres and MS SQL on Google Cloud
- SQL Server on Google Cloud: Managed Service Vs. Managed Storage
- Google Cloud SQL Pricing, Quotas, and Limits: A Cheatsheet for Cost Optimization
Learn about Azure’s approach to big data, including the Azure Data Lake solution, advanced analytics services, and managed NoSQL database services.
See top articles in our Azure big data guide: