Dear colleagues
I used the reference genome GRCh38 version GCA_000001405.15_GRCh38 / seqs_for_alignment_pipelines.ucsc_ids downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ This version was used for alignment and variant calling, however, I wanted to annotate genetic variants by snpeff v5. I did not find this version of the genome in the snpeff config file. The versions found are those of UCSC (http://hgdownload.cse.ucsc.edu) and ncbi (GRCh38.p13.RefSeq)
IF anyone familiar with snpeff, I would like to know how I can build a database with the version that was used for the variant calling as I had, knowing that this version (GCA_000001405.15_GRCh38 / seqs_for_alignment_pipelines.ucsc_ids). Unless I'm mistaken, I couldn't find the annotation files for this version which is recommended for alignment.
FYI: I used the version of from UCSC for the annotation but I found error messages like "WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS", and as mentioned in the snpeff manual the original coordinates of the VCF file are not exactly the same as the coordinates used to calculate the variant annotation .
Thank you,
Hi @vkkodali, I used GRCh38.p13.RefSeq to annotate a vcf file containing 6422 snps and 1300 InDels. I got:
I just built the database from * .gtf available in the seqs_for_alignment_pipelines directory of NCBI FTP. I got the same number of WARNING:
While with the version of UCSC (http://hgdownload.cse.ucsc.edu) I found less warning number with only 92. I find that a little strange. I do not know if I use the annotation functional with the UCSC version it will bias the prediction ????
It's kind of hard to say what's going on without digging into the details and looking at specific examples. I don't have a lot of experience with snpEff to be able to hazard a guess.
That said, I expect that you will see different set of results based on which version of annotation you use since the number of transcripts in each of these GTF files may be quite different. In the case of UCSC, you may want to confirm that they include both known (accessions with NM_/NR_ prefix) as well as the model (accessions with XM_/XR_ prefix) RefSeq transcripts.