Vcf And Mongodb
4
0
Entering edit mode
11.1 years ago
win ▴ 990

hi all, i was wondering if anyone might have some code to read through a VCF and it to a mongodb database?

vcf • 6.1k views
ADD COMMENT
2
Entering edit mode
11.1 years ago

UPDATE 2015:

Exac browser http://exac.broadinstitute.org/ uses mongodb:


yes I tried:

bad experience: It's plainful to update and , for my needs, it's faster to access the VCFs using tabix from a server or to use a standard database to store the data: https://github.com/lindenb/jvarkit#vcf2sql

EDIT: I wrote a tool named vcf2xml. Your question was an opportunity to add an example with mongodb:

java -jar vcf2.xml.jar < file.vcf | xsltproc vcf2mongo.xslt - | mongo
(...)
> db.variants.find({"chrom":"chr1","start" :{ $gt: 10057, $lte: 10234, } })
{ "_id" : ObjectId("5267e19a7bc3eca84c83784d"), "chrom" : "chr1", "start" : 10121, "end" : 10121, "ref" : "A", "alt" : [  "C" ], "qual" : 111.08000000000001, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "A",     "C" ] },     {     "sample" : "M10475",     "alleles" : [     "A",     "A" ] },     {     "sample" : "M10500",     "alleles" : [     "A",     "A" ] },     {     "sample" : "M10478",     "alleles" : [     "A",     "A" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c83784e"), "chrom" : "chr1", "start" : 10177, "end" : 10177, "ref" : "A", "alt" : [  "C" ], "qual" : 163.46, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "A",     "C" ] },     {     "sample" : "M10475", "alleles" : [     "A",     "C" ] },     {     "sample" : "M10500",     "alleles" : [     "A",     "A" ] },     {     "sample" : "M10478",     "alleles" : [     "A",     "C" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c83784f"), "chrom" : "chr1", "start" : 10180, "end" : 10180, "ref" : "T", "alt" : [  "C" ], "qual" : 79.53, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "T",     "T" ] },     {     "sample" : "M10475", "alleles" : [     "T",     "T" ] },     {     "sample" : "M10500",     "alleles" : [     "T",     "C" ] },     {     "sample" : "M10478",     "alleles" : [     "T",     "C" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c837850"), "chrom" : "chr1", "start" : 10234, "end" : 10234, "ref" : "C", "alt" : [  "T" ], "qual" : 49.1, "genotypes" : [     {     "sample" : "M128215",     "alleles" : [     "C",     "C" ] },     {     "sample" : "M10475", "alleles" : [     "C",     "T" ] },     {     "sample" : "M10500",     "alleles" : [     "C",     "C" ] },     {     "sample" : "M10478",     "alleles" : [     "C",     "T" ] } ] }
ADD COMMENT
2
Entering edit mode
11.1 years ago
Neilfws 49k

This should be pretty straightforward; since VCF is tab-delimited you just parse it and use the column headers for keys.

I have a Ruby example for CSV in this blog post which should be readily adaptable for TSV.

As Pierre says: first, figure out whether MongoDB is a good fit for your use case.

ADD COMMENT
0
Entering edit mode
10.8 years ago
aniketd86 ▴ 150

I am testing MongoDB for a similar application. Here I use pymongo to lo load a VCF file into mongo db. The scripts very basic but I would second Pierre where Tabix is much simpler to query VCF files. https://github.com/adeshpande/nosql

ADD COMMENT
0
Entering edit mode
10.1 years ago
alex ▴ 250

It depends on scale, if you have a small amount of data tabix is fine. If, on the other hand, you have terabytes of data that you need to process, mongodb really shines here where tabix will get crushed. I would use pymongo with pyvcf and just load them.

ADD COMMENT
0
Entering edit mode

Would this simply be because it can index other fields?

ADD REPLY
0
Entering edit mode

Because you can aggregate and filter across many servers or an entire datacenter. Single machine is always going to be limited.

ADD REPLY
0
Entering edit mode

Do you have any test case for this?

ADD REPLY
0
Entering edit mode

We have our own dataset but you can push in 1000 genomes data and parse it with pymongo and pyvcf. Schema is going to be a function of what data is important to you

ADD REPLY

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6