Entering edit mode
4 months ago
rwherman13
•
0
I am attempting to build a snpEff database but it keeps failing in stage 3.
FATAL ERROR: No CDS checked. This is might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.
I have made a minimalist genes.gtf and cds.fa file containing a single CDS to attempt to troubleshoot the issue. I made sure transcript IDs are consistent between the two files. However, I am still getting this error below.
00:00:02 done (1 CDSs).
00:00:02 Comparing CDS...
Labels:
'+' : OK
'.' : Missing
'*' : Error
*
CDS check: GCA_010090195.1_BGI_Ppap.V1_genomic OK: 0 Warnings: 0 Not found: 0
Errors: 1 Error percentage: 100.0%
FATAL ERROR: No CDS checked. This is might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):
'gnl_WGS_VUKY_FQV07_0014720'
Transcript IDs from database (fasta file):
'gnl_WGS_VUKY_FQV07_0014720'
snpEff seems to think there is a mismatch when there is not. Here are the exact input files that produce this error:
cds.fa
>gnl_WGS_VUKY_FQV07_0014720
ATGAGGCTCCCGCTGGCTTTCGCCGTGCTCCTCCTGGCCTCGGCGCAGGCGCTGGCCGAGGAGATGGGGGCCACCGACGA
CCTCAGCTACTGGTCGGACTGGTCCGACGGCGACCAGGTGAAGGAGGAGCTGCCGCTGCCTCTGGAGCACTTCCTGCAGA
genes.gtf
VUKY01000002.1 Genbank gene 258512 259237 . + . gene_id "FQV07_0014720"; gene_name "TAC1"; gene_biotype "protein_coding";
VUKY01000002.1 Genbank transcript 258512 259237 . + . gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720";
VUKY01000002.1 Genbank exon 258512 258634 . + . gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "1";
VUKY01000002.1 Genbank exon 259139 259234 . + . gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "2";
VUKY01000002.1 Genbank CDS 258512 258634 . + 0 gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "1"; protein_id "KAF1677113.1";
VUKY01000002.1 Genbank CDS 259139 259237 . + 0 gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "2"; protein_id "KAF1677113.1";
I am starting to wonder if there is a glitch in the most recent update of snpEff?
cds.fa ? you should provide a whole genome fasta, not a "cds fasta". What was the exact command line please.
Sorry, I also include the appropriate whole genome fasta file (renamed sequences.fa as required by snpEff, but can be found here). From my understanding, the cds.fa is the additional coding sequence file required for building the database and doing the CDS checks. I was following this new database build file structure:
This is the command line I used.
Originally, when I attempted to build the new database for htis species, the Fatal error for the CDS checks was 100% for all CDS (over 18,000). The names of the transcript IDs between the cds.fa and gtf file from the NCBI archive did not match whatsoever.
So in my attempt to troubleshoot, I made minimal files and edited them to have matching transcript ID, which again throws the same errors.