I'm trying to annotate a set of SNPs with snpeff and for some SNPs I'm getting annotations
like this one:
chr11 117163824 rs638405 C G EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gat/Cat|D139H|BACE1|mRNA|CODING|NM_001207049|NM_001207049.ex.5),
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gat/Cat|D164H|BACE1|mRNA|CODING|NM_001207048|NM_001207048.ex.5),
STOP_LOST(HIGH|MISSENSE|tGa/tCa|*195S|BACE1|mRNA|CODING|NM_138973|NM_138973.ex.5),
STOP_LOST(HIGH|MISSENSE|tGa/tCa|*220S|BACE1|mRNA|CODING|NM_138971|NM_138971.ex.5),
STOP_LOST(HIGH|MISSENSE|tGa/tCa|*239S|BACE1|mRNA|CODING|NM_138972|NM_138972.ex.5),
STOP_LOST(HIGH|MISSENSE|tGa/tCa|*264S|BACE1|mRNA|CODING|NM_012104|NM_012104.ex.5)
The first two effects (non synonymous coding) and the last four (stop lost) seem to refer to two different read frames (1 base shift with respect to each other). Is it possible/correct ? Can it be that two different transcripts have 1 base read frame shift?
There was a problem with one of the releases of the hg19 databases, but I fixed it shortly after the release. If you downloaded the hg19 database during that period of time, chances are you have a corrupt database.
Just download the latest database version:
$ java -jar snpEff.jar download hg19
00:00:00.000 Downloading database for 'hg19'
00:00:00.003 Connecting to http://downloads.sourceforge.net/project/snpeff/databases//v2_0_5/snpEff_v2_0_5_hg19.zip
00:00:11.168 Copying file (type: application/octet-stream, modified on: Thu Jan 19 21:13:03 EST 2012)
00:00:11.169 Local file name: 'snpEff_v2_0_5_hg19.zip'
...
00:00:40.209 Unzip: OK
00:00:40.209 Done
here is what I get:
$ java -Xmx10G -jar snpEff.jar eff -v -i txt hg19 ~/snpEff/test.txt -o txt | tee test.out.txt
...
11 117163824 C G SNP Hom BACE1.11 BACE1 mRNA NM_001207049 NM_001207049.ex.5 4 SYNONYMOUS_CODING V/V gtG/gtC 137 31131
11 117163824 C G SNP Hom BACE1.11 BACE1 mRNA NM_001207048 NM_001207048.ex.5 4 SYNONYMOUS_CODING V/V gtG/gtC 162 31206
11 117163824 C G SNP Hom BACE1.11 BACE1 mRNA NM_138971 NM_138971.ex.5 5 SYNONYMOUS_CODING V/V gtG/gtC 218 3 1374
11 117163824 C G SNP Hom BACE1.11 BACE1 mRNA NM_138973 NM_138973.ex.5 5 SYNONYMOUS_CODING V/V gtG/gtC 193 3 1299
11 117163824 C G SNP Hom BACE1.11 BACE1 mRNA NM_138972 NM_138972.ex.5 5 SYNONYMOUS_CODING V/V gtG/gtC 237 3 1431
11 117163824 C G SNP Hom BACE1.11 BACE1 mRNA NM_012104 NM_012104.ex.5 5 SYNONYMOUS_CODING V/V gtG/gtC 262 3 1506
What is particularly perplexing about this case is that even the (presumably ok) non-synonymous change isn't in the correct frame.
This SNP (rs638405) affects the third codon position of a (reverse-strand) GTG Valine residue of BACE1, and a C->G variant should be a synonymous change.
I'm guessing there is some issue with base-zero or base-one incompatibility of annotations at some point in the pipeline you're using.
More generically, such discordant annotations most frequently arise from the existence of annotated transcripts of either lesser quality, or representing rare transcript forms. E.g. if an mRNA was mis-sequenced introducing a frameshift, there may be some misannotation that presents an exon in the wrong frame. Or, there may be rare transcripts retaining (part of) an intron or using some alternative splice site, leading to a frameshift through part of the gene.
Remember that when doing this kind of analyses, the results are lists of hypotheses. :)
Thanks for your answer !
After I updated the hg19 database in my snpEff installation, the EFF field in my output VCF looks smth like this:
EFF=SYNONYMOUS_CODING(LOW|SILENT|gtG/gtC|V137|BACE1|mRNA|CODING|NM_001207049|NM_001207049.ex.5)
So, I guess, it was a database issue, as Pablo suggested.
Best regards, AB