Data Streaming for Analytics in a Hybrid Cloud Using Amazon S3

October 17, 2017

Topics: Cloud Sync Data Migration 6 minute read

Data Streaming for Analytics

Amazon Simple Storage Service (Amazon S3) is a widely-spread scalable storage service that allows vendors to store data blocks structured in buckets in the cloud.

Though one of the most widely-used, well-known, and oldest cloud storage services available, Amazon S3 is capable of doing a lot more. Its use has expanded into some new areas, including data streaming for analytics in hybrid cloud environments.

Hybrid architectures—which require companies to combine their on-premises data with their cloud data, have become challenging for many businesses.

Established businesses with large in-house systems, face the need to work with cloud data every time they implement a cloud service. Startups that grew up in cloud, often need to move partially back in-house in order to save on traffic costs. Dealing with the resulting mix of environments is no easy task, especially when it comes to data analysis and BI.

The scale and variety of the sources and connections that need to be incorporated are growing—the challenge is to find a way to build and manage all of these dependencies to generate a meaningful resource for the business.

In this post we’ll take a look at how Amazon S3 and NetApp’s Cloud Sync can be used to bring disparate data sources together for data analytics.

Compared to other cloud storage options for analytics, using Amazon S3 and Cloud Sync has advantages that BI architects, developers, and other data professionals can appreciate not only for their technical capabilities, but from a cost-benefit aspect as well.

No matter what kind of endpoints have to be connected or where they are based geographically, using Amazon S3 and Cloud Sync together can address all these challenges.

Why Use Amazon S3 for Data Streaming Storage?

Amazon S3 offers easy cross-region replication to create a Content Distribution Network. Given that Amazon S3 is more affordable and doesn’t require additional development, it is much more efficient for cross-region replication than any other service. This is a major advantage for improving the performance and eliminating the lag in global operations.

Amazon S3 also gives you an advantage in terms of data privacy and secure sharing with vendors. Sharing data using the Amazon S3 buckets is more secure than shipping exported data from a database, while keeping the flexibility for secure sharing and connectivity with other platforms.

For further analysis of the stored data, it is possible to do aggregations and other transformations in memory. Amazon S3 makes it possible to create a variety of connections and allows third-party tools and services to connect to it through APIs, while keeping the low cost of Amazon S3 permanent storage.

Amazon S3 Pricing: A Fraction of the Standard Cost

The other major criterion for selecting a data streaming solution is the cost. Instead of paying for a full DWH solution—such as Amazon Elastic Block Store (Amazon EBS) or Amazon Redshift (costing over $115 and $175 per TB per month respectively)—Amazon S3 adds up to only about $23 per TB per month.

When compared to the prices of custom-built ETL solutions, plus the cost of the development and maintenance teams required to run those solutions, using Cloud Sync and Amazon S3 comes at an affordable fraction of the cost of the usual ETL options.

Another point worth mentioning is that Amazon S3 is easily scalable; on the other hand, managing the flexibility of a database solution (especially in-house) during peak times brings additional costs. The maintenance team required to resolve such situations may also require additional funds.

The high cost for using database engines with dedicated storage doesn’t just apply to peak seasons: you are paying for unutilized capacity going to waste the rest of the time. Amazon S3, on the other hand, can easily be scaled upwards or downwards without limit according to your needs, cutting costs down.

Additional cost will be saved on backup storage and replicas. Using Amazon S3 eliminates the inherent need for database backup and replication, as it is already backed-up with a very-high-availability guarantee.

Amazon S3 declares a 99.99% availability with a price guarantee, but in reality the platform is designed for a durability of 99.999999999%. All of that substantially eliminates additional costs for data storage, which would be needed for backups and replicas.

Finally, historical data can be stored at an even lower cost when using the Data Archive service. Streaming generates a lot of data, and storing historical data could be helpful for long-term evaluations. Amazon S3 enables historical data to be stored for a fraction of the cost with Amazon Glacier storage (only $60/TB per year).

Getting Data to Amazon S3 with Cloud Sync

Cloud Sync makes things even easier. When facing the challenge of syncing data from multiple endpoints, Cloud Sync can be leveraged to build any links necessary.

Cloud Sync has a host of features that make it the ideal service for data migration. The service supports both NFS and CIFS file shares and offers additional options to ensure compliance with regulations regarding public cloud restriction.

Cloud Sync also contains a web-based graphical UI that allows users to manage synchronization operations and review the statuses and logs of transfers. The user interface shows the overall service management, the usage costs, as well as a wizard to help users build the relationships.

Once established, Cloud Sync allows for fully-automated synchronization, which itself is much faster and more cost-efficient than other cloud synchronization service. Cloud Sync’s synchronization process runs the source files in parallel, and when processing updates of these files, it doesn’t submit the full data again, but synchronizes just the deltas.

Both factors combine to substantially cut down the time it takes to update data.

Cloud Sync also allows the costs associated with your DWH solution to be cut down. The price for the first five connections using Cloud Sync is only $0.15 per hour.

Summary

Hybrid architecture is becoming more and more common, and with that come numerous complications. There is a strong case to be made for combining on-premises data storage for main business functions and cloud storage for functions that require flexibility or are dependent on third parties; both established companies and startups tend to own some combination of both these hybrid setups.

For new businesses that have most of their data in cloud, it may eventually become cost-effective to move the core of their business back in-house to save on ingress/egress costs.

For conservative companies, it is essential to incorporate some of the cloud services so they could stay competitive. In any case, these environments create challenges to uniting all the data in one place for analytics.

One of the simpler ways to address the analytics challenge is to incorporate Amazon S3 storage and NetApp’s Cloud Sync, which can enable to process the data from variety of sources at a very reasonable cost.

Compared to other cloud services, the combination of Cloud Sync and Amazon S3 comes at just a fraction of the usual cost, and a wide range of advantages, including:

scalability
very high availability
accessibility
resiliency
easy relationship creation and management
and automated synchronization to a variety of endpoints.

Cloud Sync offers a reliable, fast, and easy way to get your data where you need it to be so you can learn the most from it.

If you’re ready to pay less for data analytics, you can find a free, 14-day trial of Cloud Sync on the AWS Marketplace.

Gali Kovacs