Is this going to be a game changer in terms of mining of genomics data ? Will it be accessible by everyone?
"Google Genomics to help the life science community organize the world's genomic information and make it accessible and useful, using some of the same technologies that power Google services like Search and Maps"
My IT dept still says we're not allowed to send Human genomes off campus. Security. Doesn't matter how hard they promise to obey HIPAA, when it's out of our hands the IRB gets irate.
Local security policies take precedent over everything else. On the other hand, equating cloud computing services != secure is something that has to change over time. Even NIH is taking (baby) steps in this direction: http://gds.nih.gov/pdf/NIH_Position_Statement_on_Cloud_Computing.pdf
At the end of the day google compute/genomics is a service that you are paying for under a legal binding agreement which will clearly define responsibilities of the respective parties (security and otherwise). Considering the cost of local infrastructure and the clout google/amazon/microsoft bring via their volume purchasing power .. it is a matter of time before cloud gets accepted as a viable option for doing genomics.
There is always a lot of hype around the first press release. Right now, they offer a hosted instance of GATK which may be useful but is hardly earth-shattering. I agree, let's wait and see what users are able to do with it.
The download speed to my AWS server hosted in Frankfurt (I live in Germany) is 50Mb/s. My upload speed however is only 5Mb/s. Therefore, to upload 1Tb of data to AWS would take roughly 444.4 hours, or 18 and a half days assuming I get the full rate with no dropouts.
Of course you wouldn't have to transfer ALL the data before initialising some kind of analysis, but fact of the matter is that transfer time to cloud services is huge - and its only going to get worse. Illumina can half its prices a lot quicker than even the most efficient governments can dig up all the road from here to Frankfurt and lay twice the fibre.
I think it would be much more efficient to take the sequencing machine to Google. But then you'd need the biological samples and the wetlab scientists there too. And I guess having the Bioinformaticians in proximity would be nice too. That's a pretty broad range of people, so I guess you could call this new facility 'The Broad Institute of Google', or something.
So you are confirming that cloud computing is not for you (at least at this time). No disagreement there.
But that is not the case for everyone. There are many institutions (specifically those on internet 2 in the US) who have direct peering connections with cloud providers (like google) that are 10Gb both ways (and multiple links in some cases) so moving data around is not a huge hurdle. Furthermore, once you get your data into the cloud most of the providers do not use commodity internet for their internal networks so you are no longer subject to commodity network congestion problems.
You must be aware that Illumina hosts BaseSpace in AWS and hundreds of machines use it around the world to stream sequence data directly into the cloud.
Yeah, sure, this is definitely just my personal experience with AWS (and it hasn't stopped me from using AWS daily either) but I think it's an issue that's likely to get a lot worse before it gets better.
The early adopters of cloud services will probably have a good time, but once everyone in the institute is trying to send data back and fourth down their internet tubes, I expect it's going to get sluggish. Not to mention its going to affect services for e-mail and web browsing. The amount of DPI routers you'd need to manage that traffic in even a small institute is probably the same cost as several really nice compute servers :)
Here's a little quote I found from the Chief Technology Officer at Cloudera (experts in Hadoop/etc) - Mr Amr Awadallah:
A different approach to data is needed to address this problem. Instead of copying the data to go where the application is, have the application go where the data is. Bring the applications to the data. Big data is about this more than anything else.
which you might think is a silly thing for a Hadoop guy to say since our data is not where his applications are - but he and Google Genomics come from a world very different to ours. Their data comes in from many distributed users on the internet, so what he's really saying is 'give us your data directly, and then use our servers directly to analyse it' which makes sense when you want to take over the world.
But when all your terabytes of data comes from a single machine down the corridor, everything that these guys are good at (distributed data collection, message queues, Embarrassingly parallel algorithms optimised to death and then some) is no longer relevant. What we really want from the cloud is... a bunch of processors :)
And the cloud has that - but the main feature of renting processing power on the cloud is that you can accommodate highly-variable demand easily. 1 server on Monday, but when you hit the front page of Reddit on Monday night, 100 servers on Tuesday. But is that realistic of a sequencing centre's demand? If not, are you happy paying a premium for variable usage per-hour fees when you're always going to be using the same number of cores 24/7 ? :)
I'm not saying the cloud has no benefits - for some niches it's going to really change things up, particularly real-time analysis of small data which has costly algorithms to analyses it. But I'll put my neck out and say BaseSpace or anything similar is not going to be revolutionary for medium/large institutes who already have a Bioinformatics core. These guys would be better off writing software to bring to our data - but then the skill set of a Hadoop engineer is not what you want to write novel bioinformatic software in my opinion..
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.5 years ago by
John
13k
1
Entering edit mode
This is a really good point. I also think that the some of the mentality and approaches need to adapt to the specifics of bioinformatics use cases. We really need to stop storing the useless data. Say out of a human genome covered at 50x all we need are the variants - those may be stored via six (or more) orders of magnitude less data - once one recognizes that it becomes immediately obvious that shuffling all that raw data into to cloud is pointless. It may soon be actually cheaper to re sequence a sample than to transfer and store it for say five years.
IMHO what Google should focus on is what Google is already best at - categorizing and searching information not storing data. Plus we need to fundamentally move past the DNA centric view when we look at the future of medicine. There is a lot more than genome - gene expressions and regulation matter more. What do we do when we have RNA sequencing data from every patient, and we take multiple tissues from each patient, each tissue sequenced after multiple treatments and we need to track that over multiple time points? How do we make sense of that, drill down, compare and summarize the findings. That is the bottleneck not storing big binary blobs.
Very good points from both of you but I disagree on "stop storing the useless data". IMHO it is severely important to store the raw data allowing reproducible science, re-validation of your findings, and of course updating (e.g. new assembly, new software, etc).
This is not achievable by resequencing a sample cause sequence is a more or less stochastic process and you will not be able to obtain the exact same results.
Don't you think that it is exactly because sequencing is a stochastic process and biology is so complicated that we need more biological replications and redoing the entire process many times rather than reanalyzing the same dataset that may be a biased measurement to begin with? The goal is to have the cost of sequencing drop so much and be so convenient that re-sequencing a sample is cheaper and more effective than storing data from it for unspecified periods of time.
This is touching on a pet-peeve of mine - what is reproducibility? Producing the same numerical result with the exact same process or producing the same conclusion with a different tool and approach? I strongly believe in the latter - bioinformatics is a field where discoveries are difficult but checking for validity of an observation is (or should be) very easy. Reproducibility should mean that regardless of how one found the result once we know what it is the authors also provide us with a much easier solution to produce the same thing.
Hehe, I've never really thought about it like that before - you can have highly reproducible bioinformatics, but if other labs cannot come to the same conclusions when they repeat the experiment - is your data still reproducible? I guess not! The only reproducibility that matters is the reproducibility of some biological 'truth' from a biological system.
Having said that, it's nice to be able to double-check someone else's calculations, combine their data with your data to make a better analysis, and use existing data in unforeseen ways. But for that you don't really need raw data, you just need standardisation :)
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.5 years ago by
John
13k
0
Entering edit mode
Actually, I was referring to the numerical stability aspect. Should have added this. Sorry. But I totally agree on your thoughts about reproducibility in terms of converging to the same conclusion.
And Amazon charges per pennies gigabyte to upload or download. Which quickly adds up to a significant cost for once-off projects like alignment or transcriptome quantification.
If you look at Google X LifeScience on Linkedin you can see that some of the former (lead) developers of the GATK (DePristo, Carneiro, Poplin) now work for Google.
So I guess Google is planning to do more than just run GATK relatively unchanged on their infra, but also port, extend or further develop it? I looking forward to what will result from this endeavour.
Local security policies take precedent over everything else. On the other hand, equating cloud computing services != secure is something that has to change over time. Even NIH is taking (baby) steps in this direction: http://gds.nih.gov/pdf/NIH_Position_Statement_on_Cloud_Computing.pdf
At the end of the day google compute/genomics is a service that you are paying for under a legal binding agreement which will clearly define responsibilities of the respective parties (security and otherwise). Considering the cost of local infrastructure and the clout google/amazon/microsoft bring via their volume purchasing power .. it is a matter of time before cloud gets accepted as a viable option for doing genomics.
Thanks for sharing such a a valuable information. I like it a lot, its quite helpful. Keep posting such topics further also.