Question

snpEff database build: FATAL ERROR: No CDS checked

0

Entering edit mode

12 months ago

rwherman13 • 0

I am attempting to build a snpEff database but it keeps failing in stage 3.

FATAL ERROR: No CDS checked. This is might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.

I have made a minimalist genes.gtf and cds.fa file containing a single CDS to attempt to troubleshoot the issue. I made sure transcript IDs are consistent between the two files. However, I am still getting this error below.

00:00:02 done (1 CDSs).
00:00:02 Comparing CDS...
Labels:
    '+' : OK
    '.' : Missing
    '*' : Error
*
CDS check:  GCA_010090195.1_BGI_Ppap.V1_genomic OK: 0   Warnings: 0 Not found: 0
Errors: 1   Error percentage: 100.0%
FATAL ERROR: No CDS checked. This is might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):
'gnl_WGS_VUKY_FQV07_0014720'
Transcript IDs from database (fasta file):
'gnl_WGS_VUKY_FQV07_0014720'

snpEff seems to think there is a mismatch when there is not. Here are the exact input files that produce this error:

cds.fa

>gnl_WGS_VUKY_FQV07_0014720
ATGAGGCTCCCGCTGGCTTTCGCCGTGCTCCTCCTGGCCTCGGCGCAGGCGCTGGCCGAGGAGATGGGGGCCACCGACGA
CCTCAGCTACTGGTCGGACTGGTCCGACGGCGACCAGGTGAAGGAGGAGCTGCCGCTGCCTCTGGAGCACTTCCTGCAGA

genes.gtf

VUKY01000002.1  Genbank gene    258512  259237  .   +   .   gene_id "FQV07_0014720"; gene_name "TAC1"; gene_biotype "protein_coding";
VUKY01000002.1  Genbank transcript  258512  259237  .   +   .   gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720";
VUKY01000002.1  Genbank exon    258512  258634  .   +   .   gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "1";
VUKY01000002.1  Genbank exon    259139  259234  .   +   .   gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "2";
VUKY01000002.1  Genbank CDS 258512  258634  .   +   0   gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "1"; protein_id "KAF1677113.1";
VUKY01000002.1  Genbank CDS 259139  259237  .   +   0   gene_id "FQV07_0014720"; transcript_id "gnl_WGS_VUKY_FQV07_0014720"; exon_number "2"; protein_id "KAF1677113.1";

I am starting to wonder if there is a glitch in the most recent update of snpEff?

gtf snpEff CDS • 898 views

ADD COMMENT • link 12 months ago by rwherman13 • 0

1

Entering edit mode

cds.fa ? you should provide a whole genome fasta, not a "cds fasta". What was the exact command line please.

ADD REPLY • link 12 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

Sorry, I also include the appropriate whole genome fasta file (renamed sequences.fa as required by snpEff, but can be found here). From my understanding, the cds.fa is the additional coding sequence file required for building the database and doing the CDS checks. I was following this new database build file structure:

snpEff/
|---data/
|   |-- GCA_010090195.1_BGI_Ppap.V1.genomic/
|       |-- genes.gtf
|       |-- cds.fa
|       |--sequences.fa
|--snpEff.config

This is the command line I used.

java -jar ./snpEff.jar build -gtf22 -v GCA_010090195.1_BGI_Ppap.V1_genomic

Originally, when I attempted to build the new database for htis species, the Fatal error for the CDS checks was 100% for all CDS (over 18,000). The names of the transcript IDs between the cds.fa and gtf file from the NCBI archive did not match whatsoever.

FATAL ERROR: No CDS checked. This is might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):
'gene-FQV07_0011329'
'rna-gnl|WGS:VUKY|FQV07_0007911'
'rna-gnl|WGS:VUKY|FQV07_0007910'
'rna-gnl|WGS:VUKY|FQV07_0007913'
'rna-gnl|WGS:VUKY|FQV07_0007912'
'rna-gnl|WGS:VUKY|FQV07_0007915'
'rna-gnl|WGS:VUKY|FQV07_0007914'
'rna-gnl|WGS:VUKY|FQV07_0007917'
'rna-gnl|WGS:VUKY|FQV07_0007916'
'rna-gnl|WGS:VUKY|FQV07_0007919'
'rna-gnl|WGS:VUKY|FQV07_0007918'
'rna-gnl|WGS:VUKY|FQV07_0007900'
'rna-gnl|WGS:VUKY|FQV07_0007902'
'rna-gnl|WGS:VUKY|FQV07_0007901'
'rna-gnl|WGS:VUKY|FQV07_0007904'
'rna-gnl|WGS:VUKY|FQV07_0007903'
'rna-gnl|WGS:VUKY|FQV07_0007906'
'rna-gnl|WGS:VUKY|FQV07_0007905'
'rna-gnl|WGS:VUKY|FQV07_0007908'
'rna-gnl|WGS:VUKY|FQV07_0007907'
'rna-gnl|WGS:VUKY|FQV07_0007909'
'rna-gnl|WGS:VUKY|FQV07_0007973'
Transcript IDs from database (fasta file):
'lcl|VUKY01006637.1_cds_KAF1671816.1_5875'
'lcl|VUKY01007247.1_cds_KAF1671237.1_6521'
'lcl|VUKY01013002.1_cds_KAF1448495.1_11157'
'lcl|VUKY01001420.1_cds_KAF1675875.1_1365'
'lcl|VUKY01011188.1_cds_KAF1452788.1_9808'
'lcl|VUKY01017391.1_cds_KAF1438531.1_15510'
'lcl|VUKY01016899.1_cds_FQV07_0000072_14966'
'lcl|VUKY01016062.1_cds_KAF1441525.1_14180'
'lcl|VUKY01003178.1_cds_KAF1674661.1_2711'
'1_cds_KAF1441745'
'1_cds_KAF1441742'
'1_cds_KAF1441743'
'lcl|VUKY01017137.1_cds_KAF1439062.1_15304'
'lcl|VUKY01005612.1_cds_KAF1672553.1_5072'
'1_cds_KAF1441736'
'1_cds_KAF1441740'
'lcl|VUKY01000809.1_cds_FQV07_0011924_630'
'1_cds_KAF1441741'
'lcl|VUKY01014090.1_cds_KAF1445753.1_12390'
'lcl|VUKY01007596.1_cds_KAF1670968.1_6814'
'lcl|VUKY01002008.1_cds_KAF1675533.1_1734'
'lcl|VUKY01000406.1_cds_KAF1676896.1_252'

So in my attempt to troubleshoot, I made minimal files and edited them to have matching transcript ID, which again throws the same errors.

ADD REPLY • link 12 months ago by rwherman13 • 0