Forum:Arvados vs Big Data Genomics
2
3
Entering edit mode
10.0 years ago

http://bdgenomics.org/

https://arvados.org/

Has anyone tried one or both of these? Which is further along in terms of storage and retrieval of sequencing formats, especially variants?

bdg adam arvados • 5.0k views
ADD COMMENT
0
Entering edit mode

Fresh input on the topic? Has anyone adopted the technologies?

ADD REPLY
2
Entering edit mode
9.7 years ago
tth ▴ 20

Hi Jeremy,

I am currently looking into both technologies and they differ from what they provide.
Arvados has basically the approach to be a platform for data sharing and more important for provenance of derived or analyzed data. This is especially very appealing if you work in a global organisation where data is distributed to several sites and de-duplication is an important topic then. This is where Arvados seems to be a very valuable platform. Even every pipeline that runs currently in a usual shell environment can be ported to ARVADOS much faster than to e.g. a Cloudera platform.

They make heavy use of DOCKER, what basically means virtualisation on applicaton level. They implement their own Map-Reduce stack and there is where it gets tricky for me. I would like to use Spark and ADAM on e.g. a Cloudera platform having the genius data provenance and de-duplication features that ARVADOS provide combined with the big and innovative community of Cloudera.

If someone in this forum can help concerning this question it would be great to get some more insight. I think using ADAM would be feasible but it is unclear to me if it would be easily possible to run SPARK on top of the data stored in ARVADOS but I am still learning and reading. In case I find an answer I would post it here.

Before you ask, I do not contribute to Arvados ;)

ADD COMMENT
0
Entering edit mode

Hello, I'm on the Arvados team, thank you for your insightful comment.

Regarding Adam/Arvados integration, I can't share any specific plans right now, but this is something we are very interested in and hope to work on in the future.

We are also heavily involved in the common workflow language effort to standarize how tools and workflows are described so they are portable over different platforms, which will make the distinctions between underlying clustering technologies like Spark and Yarn and Arvados Crunch less relevant for day to day bioinformatics that isn't working deep in the infrastructure layer.

ADD REPLY
1
Entering edit mode
10.0 years ago
WilliamS ▴ 320

I am trying Adam / Spark / BigDataGenomics for storage and retrieval / analysis on 1000 genomes VCF data.

They are hoping to have a production release end of this year.

Arvados looks like they are building everything from scratch

https://arvados.org/projects/arvados/wiki/Technical_Architecture

while Adam is building on general purpose big data infra like Spark /HDFS / parquet / YARN. My bet would be one Adam, also because Berkley AMPLab, Broad and Mount Sinai are involved in the development.

ADD COMMENT
0
Entering edit mode

Hello! Depends what you're looking for with respect to variant storage and retrieval, but I would like to note that there is already a free hosted version of Arvados that people are welcome to evaluate (you can use any google account to login).

disclaimer: I contribute to Arvados. We would definitely appreciate any feedback you have! :)

ADD REPLY

Login before adding your answer.

Traffic: 1779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6