The first whole human genome sequence took 20 years and $3 billion dollars to complete. The modern process takes a fraction of that time and cost. But companies, universities, and laboratories still have a big problem: How can they efficiently secure, mine, process, and share billions to trillions of sequence data objects?
The solution is high performance genomics cloud computing enabled by three core components:
- Integrated, multicloud, high performance databases purpose built for massively sized genome sequences.
- Specialized genomics cloud platforms that support containers, workflow management, analytics, and secure sharing and collaboration.
- Fully managed file services that accelerate genomics cloud processing with high speed throughput and low latency, elasticity, secure sharing, and high performance genomics analytics.
HPC in AWS: Purpose-Built Database + WuXi NextCODE
Genomics sequencing involves multiple steps and generates multi-terabyte data sets. The sequencing process starts in the field by taking samples from individuals. The samples go to the labs where researchers process them through sequencers to produce raw data files that are a terabyte or more in size. Users stream or batch files to specialized databases in the cloud like WuXi NextCODE.
WuXi NextCODE leads the world in human genomic data research for precision medicine--a type of medicine that takes into account individual genetic and environmental variability, among other factors. Let’s use an example to illustrate the sheer amount of data WuXi NextCODE handles: the average number of differences between just two individuals’ DNA is 5 million. Now, multiply 5 million differences by thousands, millions or billions of people. WuXi NextCODE’s purpose-built database enables researchers to efficiently discover critical differences or mutations. This data drives research into the causes of rare diseases and cancers. Once identified, discoveries enable researchers to develop more effective treatments.
The core technology of the WuXi NextCODE platform is the genomic relational database. Of the many genomic software products in the world, this database is the only purpose-built architecture to organize, mine, and share large-sequence genomic databases.
Genomics Cloud Infrastructure: HPC in AWS
Amazon Web Services (AWS) optimizes their cloud offerings for genomics processing and collaboration. Dynamic scalability and a broad system of genomics tools and partners enable researchers to process and share massive genomics data and workloads.
AWS customers can retain their on-premises computing environment and seamlessly bridge to AWS for low-cost big data storage and high-performance dataset processing.
Fully Managed File Services: NetApp Cloud Volumes Service for AWS HPC Apps
The third core component of cloud genomics processing is NetApp Cloud Volume Services for AWS. Cloud Volumes Service for AWS is a fully managed file service suitable for HPC in the cloud that enables highly scalable, durable, and high performance SMB shares on AWS for high-performance genomics cloud processing.
Advantages of Cloud Volumes Service for AWS:
- Elasticity. Cloud Volumes Service supports HPC on AWS and fast-growing databases (like WuXi NextCODE) with on-demand scaling. CVS lets end users create and scale volumes up and down within seconds, and preserves over 460 IOPs performance speed.
- Convenience. Cloud Volumes Service fully integrates with standard and AWS-managed Active Directory.
- Increased efficiency. Cloud Volumes Service enables innovation and faster time to market with accelerated development. Machines and users create snapshot copies within a few seconds for effective data protection, testing and development, and copying and cloning.
- Speed. Speed is a critical factor in processing, storing, and analyzing genomic sequencing databases containing millions to trillions of sequences. CVS achieves over 460k IOPS on large genomics databases on AWS.
- Lower costs. Cost-effective cloud genomics processing needs both low-cost data storage and low-cost, but high-performance, processing. CVS is ideal for HPC in AWS because it enables both; AWS does not.
- Security. Cloud Volumes Service cloud architecture enables strong user identification and data security. A secure central workspace enables authorized users to create, mine, and share databases without physical data movement. Database owners protect IP by specifying compliance settings.
NetApp Cloud Volumes Service: Lowers Storage and High-Performance Computing Costs
AWS has three options for storing and analyzing large data sets:
- Option #1: Store data on AWS EBS or EFS and move datasets to EC2 for high-performance processing. Issue: Both storage tiers are more expensive than S3, making large genomics worksets an expensive proposition.
- Option #2: Store data on AWS S3 for low-cost storage and move working datasets to EC2 for processing. Issue: Ballooning cloud charges from moving massive datasets between tiers.
- Option #3: Store and process data on S3 to avoid high storage and data movement costs. Issue: Users still pay data access charges on S3, which balloons the cost of processing the data on the storage tier.
There is a fourth option: Deploy NetApp Cloud Volume Services to lower costs and accelerate genomics processing on AWS.
Lower Costs for HPC on AWS
Genomics users frequently cite low storage costs is the reason for moving to AWS.
S3 is popular with a $0.01-$0.02 cost per GB/month. Initially, this amount is lower than CVS, which charges $0.10/GB/month for Standard. However, S3 costs rise significantly when processing data because S3 charges for data access. Although data access charges are nominally low, only $0.0004 per 1000 GET requests, a large data sequencing process will easily generate thousands of GET requests per second. This adds hundreds of dollars to a single processing task occurring on S3.
NetApp CVS does not charge for data access, which considerably lowers overall processing costs for HPC on AWS to about 60%-75% of a similar operation on AWS S3.
Higher Performance: Tried and True
CVS raises performance with high IOPs: up to 460k IOPs with low latency on large genomics databases. WuXi NextCODE tested CVS on their real-world cloud genomics database on AWS. Here’s what they found.
- Cut onboarding time from weeks to less than a weekend. Using NetApp Cloud Sync, included in Cloud Volumes Service, WuXi NextCODE onboarded 50TB of data in less than a weekend.
- CVS performed 3X faster over their previous cloud file systems. The difference in processing speed was remarkable. Workloads that could not finish or froze in mid-run were processed in around an hour.
- Failed genomic sequences succeeded in Cloud Volumes Service. They tested processes that had failed with previous software, and each process successfully completed. The last test was the most telling: a genome query containing 20 trillion data points had never successfully completed, timing out after 3 to 4 hours. With CVS, the query completed successfully in under 40 minutes.