There are several annotation database for annotation variants. Like dbNSFP, dbSNP, cosmic.... I was looking for a gene based annotation database which tell me the effect of variant . For exemple : Intron, exon, splice_site_donor, missens ... But I didn't find any database like that. Those fields depends on gene/transcript database, like refGene, UCSC gene, encode ... And it will generate huge database it we try to store each possibility .
So I assume annotator like UCSC, VEP or SnpEff compute those fields during the annotation process. Something like :
def consequence(variant) :
for gene in refgene:
if variant in gene:
if variant in gene.exons:
return "exons";
if variant in gene.introns:
return "introns"
So.. What's the strategy to make gene annotation with those fields. Database or live computation ?
I rather doubt there's a
for gene in refgene
sort of loop. More likely, the variant region is flanked by some reasonable amount and then that region queried in an interval tree or similar structure. The results can then be iterated over. Otherwise things would get really slow.Thx for your reply. That was an example . My question is whether it use a database or a computed methods?
At least for snpEff, the methods section mentions the following:
That indicates to me that it's doing the actual annotation live.