NCBI recently hosted a webinar on the topic of accessing dbGaP data from the cloud (it has not been archived yet, but will be within a few weeks they say). The talk was delivered by Ben Busby, Ph.D. (Genomics Outreach Coordinator). The talk was interesting and was timely for me as I have been thinking about this problem a lot recently. Following are some resources that I have found useful so far.
The slides are on the NCBI Courses & Webinars Page and the video will be placed on the NCBI YouTube channel.
The presentation introduced the technical steps to get up and running with the SRA Toolkit using the cloud, with Amazon AWS as an example.
Before continuing you might want to review cloud concepts and terminology/jargon. To do this and perhaps try a simple test of cloud computing (running a simple RNAseq analysis pipeline), check out this tutorial we wrote as an introduction: Intro to AWS Cloud Computing
The basic concept for dbGaP cloud computing is that you fire up an Amazon instance (or other cloud provider), install the SRA Toolkit, configure it for access to the dbGaP data center, include keys for authorized access (one NGC file per dbGaP study), and access SRA data by accession ID using the various utilities of the SRA Toolkit. If you have the key (NGC file) for the study that contains a particular SRA accession ID, you will be able to access and decrypt that data.
Here is the relevant software and docs.
To avoid the installation, you can start a pre-configured instance of an Amazon Machine Image (AMI) called 'ngs-swift'. I have not yet verified all of the Amazon regions where this AMI is hosted. I did find it in the "Community AMIs" available for the North Virginia region. This is your best option to compute on data physically stored at dbGaP anyway.
This AMI is based on Ubuntu 14.04 and comes with the SRA toolkit and various bioinformatics tools installed.
NCBI claims to have a 100 Gb connection to internet2. I'm guessing that Amazon's North Virginia location is also well connected to internet2. Performance will need to be verified, and will obviously depend on load.
Using the SRA toolkit you can download (or supposedly stream): SRA raw data, SRA pileup, fastq, or SAM.
All of the code for the SRA toolkit is open source and available on GitHub:
- The SRA Toolkit and SDK
- The domain-specific API for accessing reads, alignments and pileups produced from Next Generation Sequencing)
- The back end engine for accessing SRA data
Since data transfers are encrypted and handled by software developed at NCBI, the security concern you need to think about most seriously is the instance you fire up on Amazon yourself. You can configure this as openly or securely as you are capable. It is on you to lock this down, maintain security of your key files, etc.
And of course you need to have approved dbGaP access:
Finally, for all the data you download, you must abide by various policies and guidelines, including but not limited to:
- The protected data usage guide
- The code of conduct
- The NIH Genome Data Sharing policy
- NIH Security Best Practices for Controlled-Access Data
This last one specifically discusses cloud usage and references best practice white papers developed by Amazon, Google, and DNAnexus.
Instead of going down this road entirely by yourself, you might be able to benefit from the groundwork of others who are already doing so. There are at least six relevant centers/initiatives that I am aware of. These are: CGHub, NCI Data Commons, FireCloud, ISB, 7 Bridges, Bionimbus
On the topic of cost (all prices assume N. Virginia region, are back of the envelope, and provided only as a reference point):
- Amazon does not charge for data coming into EC2. They want you to get data there so that they can charge you to store it and serve it up to your customers.
- You do have to pay for any temporary storage of data that you store either on EBS volumes, S3, or glacier. As a reference point, EBS provisioned IOPS (SSD) storage is $0.10 per GB-month. So ~$10,000 per month for 100 TB of buffer. The elasticity of cloud compute means that you can grow or shrink this as needed and only pay for what you use. It is of course on you to manage this efficiently though.
- To get results back from EC2 to our local compute would be $0.09 per GB (assuming a volume of 10 TB / month). So ~$900 for 10 TB coming out in a month. If you just bring VCFs back for long term storage and ad hoc analysis, how much volume would this be? Possibly better than the order of magnitude drop I am assuming from raw data to results you care to store long term on your own disk.
- Finally, you have the actual compute costs. These would really depend on what you do. Again, you only pay for instances while they are running analysis, then you shut them down. A lot will depend on how cleverly you configure your compute needs and how little local disk, cpu, and memory you can get away with. If you need serious boxes that look anything like blades in a typical high performance compute center, it would be very important to use them on demand only, as these can be up to ~$1.50 per hour, per instance.