hi all, i was wondering if anyone might have some code to read through a VCF and it to a mongodb database?
hi all, i was wondering if anyone might have some code to read through a VCF and it to a mongodb database?
UPDATE 2015:
Exac browser http://exac.broadinstitute.org/ uses mongodb:
yes I tried:
bad experience: It's plainful to update and , for my needs, it's faster to access the VCFs using tabix from a server or to use a standard database to store the data: https://github.com/lindenb/jvarkit#vcf2sql
EDIT: I wrote a tool named vcf2xml. Your question was an opportunity to add an example with mongodb:
java -jar vcf2.xml.jar < file.vcf | xsltproc vcf2mongo.xslt - | mongo
(...)
> db.variants.find({"chrom":"chr1","start" :{ $gt: 10057, $lte: 10234, } })
{ "_id" : ObjectId("5267e19a7bc3eca84c83784d"), "chrom" : "chr1", "start" : 10121, "end" : 10121, "ref" : "A", "alt" : [ "C" ], "qual" : 111.08000000000001, "genotypes" : [ { "sample" : "M128215", "alleles" : [ "A", "C" ] }, { "sample" : "M10475", "alleles" : [ "A", "A" ] }, { "sample" : "M10500", "alleles" : [ "A", "A" ] }, { "sample" : "M10478", "alleles" : [ "A", "A" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c83784e"), "chrom" : "chr1", "start" : 10177, "end" : 10177, "ref" : "A", "alt" : [ "C" ], "qual" : 163.46, "genotypes" : [ { "sample" : "M128215", "alleles" : [ "A", "C" ] }, { "sample" : "M10475", "alleles" : [ "A", "C" ] }, { "sample" : "M10500", "alleles" : [ "A", "A" ] }, { "sample" : "M10478", "alleles" : [ "A", "C" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c83784f"), "chrom" : "chr1", "start" : 10180, "end" : 10180, "ref" : "T", "alt" : [ "C" ], "qual" : 79.53, "genotypes" : [ { "sample" : "M128215", "alleles" : [ "T", "T" ] }, { "sample" : "M10475", "alleles" : [ "T", "T" ] }, { "sample" : "M10500", "alleles" : [ "T", "C" ] }, { "sample" : "M10478", "alleles" : [ "T", "C" ] } ] }
{ "_id" : ObjectId("5267e19a7bc3eca84c837850"), "chrom" : "chr1", "start" : 10234, "end" : 10234, "ref" : "C", "alt" : [ "T" ], "qual" : 49.1, "genotypes" : [ { "sample" : "M128215", "alleles" : [ "C", "C" ] }, { "sample" : "M10475", "alleles" : [ "C", "T" ] }, { "sample" : "M10500", "alleles" : [ "C", "C" ] }, { "sample" : "M10478", "alleles" : [ "C", "T" ] } ] }
This should be pretty straightforward; since VCF is tab-delimited you just parse it and use the column headers for keys.
I have a Ruby example for CSV in this blog post which should be readily adaptable for TSV.
As Pierre says: first, figure out whether MongoDB is a good fit for your use case.
I am testing MongoDB for a similar application. Here I use pymongo to lo load a VCF file into mongo db. The scripts very basic but I would second Pierre where Tabix is much simpler to query VCF files. https://github.com/adeshpande/nosql
It depends on scale, if you have a small amount of data tabix is fine. If, on the other hand, you have terabytes of data that you need to process, mongodb really shines here where tabix will get crushed. I would use pymongo with pyvcf and just load them.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Would this simply be because it can index other fields?
Because you can aggregate and filter across many servers or an entire datacenter. Single machine is always going to be limited.
Do you have any test case for this?
We have our own dataset but you can push in 1000 genomes data and parse it with pymongo and pyvcf. Schema is going to be a function of what data is important to you