More about AWS database
- AWS Data Analytics: Choosing the Best Option for You
- Amazon DocumentDB: Basics and Best Practices
- AWS NoSQL: Choosing the Best Option for You
- AWS Big Data: 6 Options You Should Consider
- Oracle on AWS: Managed Service vs. Managed Storage Options
- AWS Oracle RDS: Running Your First Oracle Database on Amazon
- SQL Server in AWS: Two Deployment Options
- DynamoDB Pricing: How to Optimize Usage and Reduce Costs
- AWS Databases: The Power of Purpose-Built Database Engines
- AWS MySQL: Two Ways to Enjoy MySQL as a Service
- AWS Oracle: How to Lift and Shift Your Oracle DB to Amazon
- Overcome AWS RDS instance Size Limits with Data Tiering to AWS S3
What is AWS Big Data?
AWS big data refers to the collection, storage, and use of big data in AWS. It is supported by a range of services and capabilities, including analytics, highly scalable storage, and wide support for compliance regulations.
This article is part of our series on AWS database technology.
In this article, you will learn:
- 6 Big Data Analytics Options on AWS
- How Do AWS Big Data Solutions Work?
- AWS Big Data with NetApp Cloud Volumes ONTAP
6 Big Data Analytics Options on AWS
AWS’s most impressive support for big data implementations comes in the form of analytics solutions. The provider offers a variety of services that you can use to automate data analysis, manipulate datasets, and derive insights.
Kinesis is a service that enables you to collect and analyze real-time data streams. Supported streams include Internet of things (IoT) telemetry data, website clickstreams, and application logs. You can export data from Kinesis to a variety of AWS services, including Redshift, Lambda, Elastic MapReduce (Amazon EMR), and S3 storage.
You can also use Kinesis to build custom applications for streaming data using the Kinesis Client Library (KCL). This library provides support for dynamic content, alert generation, and real-time dashboards.
EMR is a framework for distributed computing that you can use to process and store data. It is based on Apache Hadoop and clustered EC2 instances. Hadoop is a well-established framework for big data processing and analysis.
When you implement EMR, it provisions, manages, and maintains your infrastructure for Hadoop, enabling you to focus on analytics. EMR supports the most commonly used Hadoop tools, including Spark, Pig, and Hive.
Glue is a service that enables you to process data and perform extract, transform, and load (ETL) operations. You can use it to clean, enrich, catalog, and transfer data between your data stores. Glue is a serverless service meaning you are only charged for the resources you consume, and you do not have to worry about provisioning infrastructure.
Amazon Machine Learning (Amazon ML)
Amazon ML is a service that provides support for developing machine learning models without ML expertise. It includes wizards, visualization tools, and pre-built models to get you started. The service can walk you through evaluating data for training and optimizing your trained model to fit business needs. Once complete, you can access your model’s output through batch exports or API.
Redshift is a fully-managed data warehouse service that you can use for business intelligence analytics. It is optimized for large data queries of structured and semi-structured data using SQL. Query results are saved to S3 data lake storage and can be ingested by a variety of analytics services, including SageMaker, Athena, and EMR.
Redshift also includes a feature called Spectrum that you can use to query data in S3 without performing ETL processes. This feature evaluates your data storage and requirements for the query and optimizes the process to minimize the amount of S3 data to be read. This helps minimize costs and speeds query times.
QuickSight is a service for business analytics that you can use to perform ad-hoc data analysis and build visualizations. You can use it to ingest numerous data sources, including from on-premises databases, exported Excel or CSV files, and AWS services, including S3, RDS, and Redshift.
QuickSight uses a “super-fast, parallel, in-memory calculation engine” (SPICE). This engine is based on columnar storage and uses machine code generation to produce interactive queries. When you perform queries, the engine persists the data until it is manually deleted by the user to ensure that subsequent queries are as fast as possible.
How Do AWS Big Data Solutions Work?
AWS offers numerous solutions to help you address your entire big data management cycle. These tools and technologies make it possible and cost effective to collect, store, and analyze your data sets. The tools available support the big data cycle from collection to consumption.
Collection solutions focus on helping you accumulate your raw data, structured and unstructured. Solutions can integrate natively with AWS services or ingest data gathered from exports.
In AWS, big data collection is supported by services and capabilities that include:
- Kinesis Streams and Kinesis Firehose for real-time data stream ingestion
- Integration with a range of services and data sources through manual import or API
Storing big data requires highly scalable solutions that can handle data before and after processing. These solutions are accessible to a variety of processing and analytics services and can typically be tiered to help you reduce storage costs.
In AWS, big data storage is supported by the following services:
- S3 and Lake Formation for object storage
- S3 Glacier and Backup for backups and archives
- Glue and Lake Formation for data cataloging
- Data Exchange for third-party data
Processing and Analysis
Processing and analysis solutions enable you to transform raw data into data consumable for analytics. This generally involves sorting, aggregating, and joining data but can also involve applying new data schemas or translating data into different formats.
In AWS, processing and analysis are supported by a range of services including:
- Elasticsearch Service for operational analytics
- Athena for interactive analytics
- Redshift for data warehousing
- EMR for big data processing
- Kinesis Analytics for real-time analytics
Consumption and Visualization
Consumption and visualization solutions help you derive and share insights from your data. These solutions enable you to explore your datasets and analysis and highlight those that are relevant or provide the most accurate predictions or recommendations.
In AWS, consumption and visualization of big data is supported by:
- Quicksight for visualizations and dashboards
- Deep Learning AMIs and Sagemaker for machine learning and predictive analytics
AWS Big Data with NetApp Cloud Volumes ONTAP
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
In particular, Cloud Volumes ONTAP helps in addressing database workloads challenges in the cloud, and filling the gap between your cloud-based database capabilities and the public cloud resources it runs on.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, the built-in storage efficiency features have a direct impact on costs for NoSQL in cloud deployments. The data protection and flexibility provided by features such as snapshots and data cloning give NoSQL database administrators and big data engineers the power to manage large volumes of data effectively.