I am new working with annotation. I am reading about the different methods and applying some annotation with VEP to carry out structural annotation in variants called exons of transcripts affected by CNVs. I am using Loeftee and Lof plugins to detect high confident Lof variants plus common annotations such as HGVS.
I was wondering how annotations like these can be validated.
I am creating a structural variation VCF file and a standard VCF file with fake CNVs and SNP in specific locations to check that my pipeline does what I expect.
In a quick look, I have not found on literature tools/methods to apply validation maybe due to my ignorance in this new field to me.
What do you mean by "validation"?
A method to provide the validity or accuracy of the annotation.
Are you using multiple accounts? Please email the admin to have them merged unless you have good reason not to merge them.
Sorry, I didn't know this. It seems that one has been created by my job email and the second by my personal email. I will email the admin.
Please be more detailed - can you show us examples of accurate vs inaccurate annotations?
Taken from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2781113/
Two important papers examining genome annotation error in one and three small genomes respectively [4],[5] predicted that at least 8% of molecular function annotations were incorrect. Depending on the definition of function used, Devos and Valencia further suggested that misannotation levels could be as high as 37%. Other large scale [6] and anecdotal studies describe numerous examples of annotation error (see [7]–[11] for some examples). In a recent paper that modeled annotation error in the Gene Ontology database, it was estimated that up to 49% of computationally annotated sequences could be misannotated [12]. Considering the problem from a different perspective, models of error propagation have shown that with sufficient initial error in a database, error propagation can significantly degrade the quality of the annotations it contains [13],[14] and specific examples of error propagation have been noted [15],[16]. Although functional misannotation remains a significant concern [17],[18], an in depth analysis of the prevalence of annotation error in large public databases has yet to be performed.
A personal example:
If I tell you what is the gene at location chrX:17,393,543 (VEP 99). And this is annotated as DIABLO gene when actually it is NHS gene. This is a error.
I still don't understand this 100% but annotation tools are largely dependent on accurate databases, and no database is 100% accurate, so I'm not sure how you'd validate unless you know exactly what error you're looking for.
Some errors are wide-spread (actually, fixing them addresses a large family of similar errors) but some are not even traceable - your DIABLO example is one such. Current web based VEP annotates is correctly as NHS. Your best bet might be to annotate with the latest annotation database.
Thanks Ram for your answer.
The reason, I am asking this is because I am working in a clinical genomic lab and we need to validate somehow the tools we use. I am new in this lab and I am the only bioinformatician and I don't know how to test or validate the annotation we use to calculate accuracy or the error rate