Question

Why Doesn'T Annovar With Dbsnp 129 Filter Anything

0

Entering edit mode

12.2 years ago

mark.dunning ▴ 230

Hi all,

I was using annovar to remove common variants from my list of potentially interesting variatns using dbsnp132. However, I was finding that 99% of my variants were being found in dbsnp132, which is too many to remove from the analysis.

After reading some comments about dbsnp129 being more widely-used I switched to annotating using this database. I downloaded the hg19 version of dbsnp129 from the annovar website. However, according to annovar none of my variants are found in dbsnp129. Can this really be true?

Here is the annovar run with dbsnp132

[dunnin01@uk-cri-lcst01 annovar]$ ./annotate_variation.pl --buildver hg19 -filter -dbtype  snp132 myvars.txt humandb/
NOTICE: Variants matching filtering criteria are written to myvars.txt.hg19_snp132_dropped, other variants are written to myvars.txt.hg19_snp132_filtered
NOTICE: Processing next batch with 36064 unique variants in 36079 input lines
NOTICE: Scanning filter database humandb/hg19_snp132.txt...Done
NOTICE: Variants with invalid input format were written to myvars.txt.invalid_input
[dunnin01@uk-cri-lcst01 annovar]$ wc -l myvars.txt.hg19_snp132_dropped
35445 myvars.txt.hg19_snp132_dropped

And with dbsnp129

 [dunnin01@uk-cri-lcst01 annovar]$ ./annotate_variation.pl --buildver hg19 -filter -dbtype  snp129 myvars.txt humandb/
 NOTICE: Variants matching filtering criteria are written to myvars.txt.hg19_snp129_dropped, other variants are written to myvars.txt.hg19_snp129_filtered
 NOTICE: Processing next batch with 36064 unique variants in 36079 input lines
 NOTICE: Database index loaded. Total number of bins is 2699057 and  number of bins to be scanned is 0
 NOTICE: Scanning filter database humandb/hg19_snp129.txt...Done
 NOTICE: Variants with invalid input format were written to myvars.txt.invalid_input

 dunnin01@uk-cri-lcst01 annovar]$ wc -l myvars.txt.hg19_snp129_dropped
 0 myvars.txt.hg19_snp129_dropped

annovar dbsnp • 4.1k views

ADD COMMENT • link 12.1 years ago by mark.dunning ▴ 230

score 0 · Answer 1 · 2013-03-05

Have you checked the sizes of your ANNOVAR snp129 and snp132 database files? Also, what happens if you run a variant known to be in dbSNP129?

You could as well try a subset of the latest dbSNP137. I currently use snp137NonFlagged, which contains SNPs > 1% minor allele frequency, mapping only once to reference assembly, and not flagged as "clinically associated". You can download it from ANNOVAR. My hg19_snp137NonFlagged.txt file has ~55M lines.

If you are mainly interested in exon variants, you could additionally check against the SNVs of the recent 6,500 exomes study (easy to get from ANNOVAR, database files esp6500*).

score 0 · Answer 2 · 2013-03-05

this definitely looks like an error with the snp129 file. I remember that snp132 is around 3 times bigger than snp129, so the overlap should be important, and it wouldn't make sense at all to see a 99% match with snp132 where the same variants to not match at all with snp129. possible errors: using the hg18 version of snp129, wrongly downloaded or corrupted file (and/or index file),... all of them would imply checking again that snp129 file, so make sure you download it again through

annotate_variation.pl -downdb -buildver hg19 -webfrom annovar snp129 humandb

and that both file and index are downloaded fine (in fact the "Database index loaded" message seems to indicate that you are not using an index file, which should be there if you download the database file through the command above).

as a side comment, I would stress on what Christof said, which is why exactly you would want to filter your data using dbsnp129. when NGS started, dbsnp129 was used as the gold standard for variation, and all its variants where considered strictly polymorphic, so everything not contained in there was consider as novel. dbsnp130 introduced the first batch of SNPs from 1000genomes, and the rules started to change, because rare variants where starting to populate dbsnp. through the years this "bias" has increased, and for that reason one now should not really want to filter variants with dbsnp129 searching for novelty, but using allele frequencies (1000genomes, 6500exomes,...) searching for rareness. the snp137NonFlagged table mentioned is the best approximation you may have from dbsnp, since what they're offering in it is a source of pure variation, variants that have been reported with at least 1% of population frequency and that they aren't associated with any pathology. this does not mean that all the rest are rare or pathogenic, it just means that what has been recorded is what we know right now, and that the purity of that table would be increased with time, so in my honest opinion this is the way to go. in fact, we do annotate all our variants with snp137NonFlagged, 1000genomes, 6500exomes, and even some others too.

score 0 · Answer 3 · 2013-03-05

0

Entering edit mode

12.1 years ago

Sebastian ▴ 10

dbSNP129 was initially annotated to hg18, maybe there is still a problem of coordinates after lift-over? Did you check the variants in v129 and their positions in v132 (as they usually should be annotated in both)? Are there differences due to 0- or 1-based coordinates?

ADD COMMENT • link 12.1 years ago by Sebastian ▴ 10

score 0 · Answer 4 · 2013-03-06

0

Entering edit mode

12.1 years ago

mark.dunning ▴ 230

Thanks all for your help and advice. It sounds like there may be an issue with the hg19 download of the database that I will follow-up with the annovar people. I'll definitely check out the 6500 exomes dataset too.

Cheers,

Mark

ADD COMMENT • link 12.1 years ago by mark.dunning ▴ 230