Hi!
I'm hoping to use SnpEff to annotate SNPs output from the GATK variant calling workflow performed on RNAseq data from sticklebacks. I am currently having an issue where the custom database I build has a high number of START and STOP codon warnings/errors. SnpEff runs okay, however, most of the SNPs are tagged with warnings.
I have built my own database using the 2020 stickleback genome and annotations from https://stickleback.genetics.uga.edu/ and I have converted the gff3 annotations to gtf using gffread.
I prepared the protein and CDS files from the gtf file using gffread and built my custom database according to the instructions on the snpEff website;
snpEff.config entry:
#Stickleback2020 genomec, version stick_v5c
stick_v5c.genome : stick_v5c
stick_v5c.M.codonTable: Vertebrate_Mitochondrial
stick_v5c.codonTable: Standard
Build code:
java -Xmx20g -jar snpEff.jar build -gtf22 -v stick_v5c 2>&1 | tee stick_v5c.build
When I run snpEff dump I get the following output
Genome stats :
#-----------------------------------------------
# Genome name : 'stick_v5c'
# Genome version : 'stick_v5c'
# Genome ID : 'stick_v5c[0]'
# Has protein coding info : true
# Has Tr. Support Level info : true
# Genes : 27582
# Protein coding genes : 27582
#-----------------------------------------------
# Transcripts : 27583
# Avg. transcripts per gene : 1.00
# TSL transcripts : 0
#-----------------------------------------------
# Checked transcripts :
# AA sequences : 27091 ( 98.22% )
# DNA sequences : 27574 ( 99.97% )
#-----------------------------------------------
# Protein coding transcripts : 27583
# Length errors : 496 ( 1.80% )
# STOP codons in CDS errors : 412 ( 1.49% )
# START codon errors : 10711 ( 38.83% )
# STOP codon warnings : 27085 ( 98.19% )
# UTR sequences : 20167 ( 73.11% )
# Total Errors : 10972 ( 39.78% )
#-----------------------------------------------
# Cds : 280342
# Exons : 287252
# Exons with sequence : 287252
# Exons without sequence : 0
# Avg. exons per transcript : 10.41
#-----------------------------------------------
# Number of chromosomes : 24
# Chromosomes : Format 'chromo_name size codon_table'
# 'IV' 34181212 Standard
# 'VII' 30776923 Standard
# 'I' 29619991 Standard
# 'II' 23686546 Standard
# 'IX' 20843631 Standard
# 'XIII' 20748428 Standard
# 'XII' 20694444 Standard
# 'XIX' 20580295 Standard
# 'VIII' 20553084 Standard
# 'XX' 20445003 Standard
# 'XVII' 20195758 Standard
# 'Un' 19879834 Standard
# 'XVI' 19507025 Standard
# 'VI' 18825451 Standard
# 'X' 17985176 Standard
# 'III' 17759012 Standard
# 'XI' 17651971 Standard
# 'XXI' 17421465 Standard
# 'XV' 17318724 Standard
# 'XIV' 16147532 Standard
# 'XVIII' 15939407 Standard
# 'Y' 15866398 Standard
# 'V' 15550311 Standard
# 'M' 15742 Vertebrate_Mitochondrial
#-----------------------------------------------
And when I run my vcf files through snpEff I end up with a large number of error codes
WARNINGS: Some warning were detected
Warning type Number of warnings
WARNING_TRANSCRIPT_INCOMPLETE 13734
WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS 19061
WARNING_TRANSCRIPT_NO_START_CODON 198371
WARNING_TRANSCRIPT_NO_STOP_CODON 348881
The snpEff github doesn't go super into detail about this error but suggests it may be due to issues with the annotations file format. I have noticed that the example annotations input they provide here includes annotations for both the CDS and the START/STOP codons separately, while my annotations file has CDS, exons and transcript but no START/STOP codons. From doing some reading it seems like the format of my annotations file is pretty standard, as the CDS is considered to implicitly contain the START/STOP codons, so I'm not sure whether this is related or not.
I would be very grateful for any advice on what may be causing these errors or possible leads on how to get around it! Please let me know if there is any additional information I can provide!
Thanks!
Hi, have you found solution for this issue?