Question

Forum:Arvados vs Big Data Genomics

3

Entering edit mode

10.4 years ago

Jeremy Leipzig 23k

Has anyone tried one or both of these? Which is further along in terms of storage and retrieval of sequencing formats, especially variants?

bdg adam arvados • 5.2k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 10.4 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Fresh input on the topic? Has anyone adopted the technologies?

ADD REPLY • link 8.2 years ago by podro • 0

score 2 · Answer 1 · 2015-03-05

Hi Jeremy,

I am currently looking into both technologies and they differ from what they provide.
Arvados has basically the approach to be a platform for data sharing and more important for provenance of derived or analyzed data. This is especially very appealing if you work in a global organisation where data is distributed to several sites and de-duplication is an important topic then. This is where Arvados seems to be a very valuable platform. Even every pipeline that runs currently in a usual shell environment can be ported to ARVADOS much faster than to e.g. a Cloudera platform.

They make heavy use of DOCKER, what basically means virtualisation on applicaton level. They implement their own Map-Reduce stack and there is where it gets tricky for me. I would like to use Spark and ADAM on e.g. a Cloudera platform having the genius data provenance and de-duplication features that ARVADOS provide combined with the big and innovative community of Cloudera.

If someone in this forum can help concerning this question it would be great to get some more insight. I think using ADAM would be feasible but it is unclear to me if it would be easily possible to run SPARK on top of the data stored in ARVADOS but I am still learning and reading. In case I find an answer I would post it here.

Before you ask, I do not contribute to Arvados ;)

Ram · Answer 2 · 2014-11-16

1

Entering edit mode

10.4 years ago

WilliamS ▴ 320

I am trying Adam / Spark / BigDataGenomics for storage and retrieval / analysis on 1000 genomes VCF data.

They are hoping to have a production release end of this year.

Arvados looks like they are building everything from scratch

https://arvados.org/projects/arvados/wiki/Technical_Architecture

while Adam is building on general purpose big data infra like Spark /HDFS / parquet / YARN. My bet would be one Adam, also because Berkley AMPLab, Broad and Mount Sinai are involved in the development.

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by WilliamS ▴ 320

0

Entering edit mode

Hello! Depends what you're looking for with respect to variant storage and retrieval, but I would like to note that there is already a free hosted version of Arvados that people are welcome to evaluate (you can use any google account to login).

disclaimer: I contribute to Arvados. We would definitely appreciate any feedback you have! :)

ADD REPLY • link 10.3 years ago by Nancy Ouyang ▴ 170