Question

Vcf And Mongodb

0

Entering edit mode

11.1 years ago

win ▴ 990

hi all, i was wondering if anyone might have some code to read through a VCF and it to a mongodb database?

vcf • 6.1k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 11.1 years ago by win ▴ 990

Ram · Answer 1 · 2013-10-22

UPDATE 2015:

Exac browser http://exac.broadinstitute.org/ uses mongodb:

yes I tried:

bad experience: It's plainful to update and , for my needs, it's faster to access the VCFs using tabix from a server or to use a standard database to store the data: https://github.com/lindenb/jvarkit#vcf2sql

EDIT: I wrote a tool named vcf2xml. Your question was an opportunity to add an example with mongodb:

java -jar vcf2.xml.jar < file.vcf | xsltproc vcf2mongo.xslt - | mongo
(...)
> db.variants.find({"chrom":"chr1","start" :{ $gt: 10057, $lte: 10234, } })
{ "_id" : ObjectId("5267e19a7bc3eca84c83784d"), "chrom" : "chr1", "start" : 10121, "end" : 10121, "ref" : "A", "alt" : [  "C" ], "qual" : 111.08000000000001, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "A",     "C" ] },     {     "sample" : "M10475",     "alleles" : [     "A",     "A" ] },     {     "sample" : "M10500",     "alleles" : [     "A",     "A" ] },     {     "sample" : "M10478",     "alleles" : [     "A",     "A" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c83784e"), "chrom" : "chr1", "start" : 10177, "end" : 10177, "ref" : "A", "alt" : [  "C" ], "qual" : 163.46, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "A",     "C" ] },     {     "sample" : "M10475", "alleles" : [     "A",     "C" ] },     {     "sample" : "M10500",     "alleles" : [     "A",     "A" ] },     {     "sample" : "M10478",     "alleles" : [     "A",     "C" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c83784f"), "chrom" : "chr1", "start" : 10180, "end" : 10180, "ref" : "T", "alt" : [  "C" ], "qual" : 79.53, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "T",     "T" ] },     {     "sample" : "M10475", "alleles" : [     "T",     "T" ] },     {     "sample" : "M10500",     "alleles" : [     "T",     "C" ] },     {     "sample" : "M10478",     "alleles" : [     "T",     "C" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c837850"), "chrom" : "chr1", "start" : 10234, "end" : 10234, "ref" : "C", "alt" : [  "T" ], "qual" : 49.1, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "C",     "C" ] },     {     "sample" : "M10475", "alleles" : [     "C",     "T" ] },     {     "sample" : "M10500",     "alleles" : [     "C",     "C" ] },     {     "sample" : "M10478",     "alleles" : [     "C",     "T" ] } ] }

score 2 · Answer 2 · 2013-10-22

2

Entering edit mode

11.1 years ago

Neilfws 49k

This should be pretty straightforward; since VCF is tab-delimited you just parse it and use the column headers for keys.

I have a Ruby example for CSV in this blog post which should be readily adaptable for TSV.

As Pierre says: first, figure out whether MongoDB is a good fit for your use case.

ADD COMMENT • link 11.1 years ago by Neilfws 49k

score 0 · Answer 3 · 2014-02-20

0

Entering edit mode

10.8 years ago

aniketd86 ▴ 150

I am testing MongoDB for a similar application. Here I use pymongo to lo load a VCF file into mongo db. The scripts very basic but I would second Pierre where Tabix is much simpler to query VCF files. https://github.com/adeshpande/nosql

ADD COMMENT • link 10.8 years ago by aniketd86 ▴ 150

Ram · Answer 4 · 2014-11-05

0

Entering edit mode

10.1 years ago

alex ▴ 250

It depends on scale, if you have a small amount of data tabix is fine. If, on the other hand, you have terabytes of data that you need to process, mongodb really shines here where tabix will get crushed. I would use pymongo with pyvcf and just load them.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by alex ▴ 250

0

Entering edit mode

Would this simply be because it can index other fields?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.1 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

Because you can aggregate and filter across many servers or an entire datacenter. Single machine is always going to be limited.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by alex ▴ 250

0

Entering edit mode

Do you have any test case for this?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by always_learning ★ 1.1k

0

Entering edit mode

We have our own dataset but you can push in 1000 genomes data and parse it with pymongo and pyvcf. Schema is going to be a function of what data is important to you

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by alex ▴ 250