Question

snpEff custom database - high percentage of start/stop codon errors

1

Entering edit mode

2.0 years ago

Bismuth310 ▴ 10

Hi!

I'm hoping to use SnpEff to annotate SNPs output from the GATK variant calling workflow performed on RNAseq data from sticklebacks. I am currently having an issue where the custom database I build has a high number of START and STOP codon warnings/errors. SnpEff runs okay, however, most of the SNPs are tagged with warnings.

I have built my own database using the 2020 stickleback genome and annotations from https://stickleback.genetics.uga.edu/ and I have converted the gff3 annotations to gtf using gffread.

I prepared the protein and CDS files from the gtf file using gffread and built my custom database according to the instructions on the snpEff website;

snpEff.config entry:

#Stickleback2020 genomec, version stick_v5c
stick_v5c.genome : stick_v5c
stick_v5c.M.codonTable: Vertebrate_Mitochondrial
stick_v5c.codonTable: Standard

Build code:

java -Xmx20g -jar snpEff.jar build -gtf22 -v stick_v5c 2>&1 | tee stick_v5c.build

When I run snpEff dump I get the following output

    Genome stats :
#-----------------------------------------------
# Genome name                : 'stick_v5c'
# Genome version             : 'stick_v5c'
# Genome ID                  : 'stick_v5c[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 27582
# Protein coding genes       : 27582
#-----------------------------------------------
# Transcripts                : 27583
# Avg. transcripts per gene  : 1.00
# TSL transcripts            : 0
#-----------------------------------------------
# Checked transcripts        : 
#               AA sequences :  27091 ( 98.22% )
#              DNA sequences :  27574 ( 99.97% )
#-----------------------------------------------
# Protein coding transcripts : 27583
#              Length errors :    496 ( 1.80% )
#  STOP codons in CDS errors :    412 ( 1.49% )
#         START codon errors :  10711 ( 38.83% )
#        STOP codon warnings :  27085 ( 98.19% )
#              UTR sequences :  20167 ( 73.11% )
#               Total Errors :  10972 ( 39.78% )
#-----------------------------------------------
# Cds                        : 280342
# Exons                      : 287252
# Exons with sequence        : 287252
# Exons without sequence     : 0
# Avg. exons per transcript  : 10.41
#-----------------------------------------------
# Number of chromosomes      : 24
# Chromosomes                : Format 'chromo_name size codon_table'
#       'IV'    34181212    Standard
#       'VII'   30776923    Standard
#       'I' 29619991    Standard
#       'II'    23686546    Standard
#       'IX'    20843631    Standard
#       'XIII'  20748428    Standard
#       'XII'   20694444    Standard
#       'XIX'   20580295    Standard
#       'VIII'  20553084    Standard
#       'XX'    20445003    Standard
#       'XVII'  20195758    Standard
#       'Un'    19879834    Standard
#       'XVI'   19507025    Standard
#       'VI'    18825451    Standard
#       'X' 17985176    Standard
#       'III'   17759012    Standard
#       'XI'    17651971    Standard
#       'XXI'   17421465    Standard
#       'XV'    17318724    Standard
#       'XIV'   16147532    Standard
#       'XVIII' 15939407    Standard
#       'Y' 15866398    Standard
#       'V' 15550311    Standard
#       'M' 15742   Vertebrate_Mitochondrial
#-----------------------------------------------

And when I run my vcf files through snpEff I end up with a large number of error codes

WARNINGS: Some warning were detected
Warning type    Number of warnings
WARNING_TRANSCRIPT_INCOMPLETE           13734
WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS 19061
WARNING_TRANSCRIPT_NO_START_CODON   198371
WARNING_TRANSCRIPT_NO_STOP_CODON    348881

The snpEff github doesn't go super into detail about this error but suggests it may be due to issues with the annotations file format. I have noticed that the example annotations input they provide here includes annotations for both the CDS and the START/STOP codons separately, while my annotations file has CDS, exons and transcript but no START/STOP codons. From doing some reading it seems like the format of my annotations file is pretty standard, as the CDS is considered to implicitly contain the START/STOP codons, so I'm not sure whether this is related or not.

I would be very grateful for any advice on what may be causing these errors or possible leads on how to get around it! Please let me know if there is any additional information I can provide!

Thanks!

snpEff • 727 views

ADD COMMENT • link updated 17 months ago by boczniak767 ▴ 870 • written 2.0 years ago by Bismuth310 ▴ 10

0

Entering edit mode

Hi, have you found solution for this issue?

ADD REPLY • link 17 months ago by boczniak767 ▴ 870