snpEff: ERROR_CHROMOSOME_NOT_FOUND
1
0
Entering edit mode
2.9 years ago
ziziqolo ▴ 10

Hey BioStars

I'm trying to annotate my bacteria variants with SnpEff,

java -jar snpEff.jar -v Mycoplasma_hyopneumoniae_168_l /mnt/f/mycopn/variantcalling.vcf > /mnt/f/mycopn/res.ann.vcf

All my Reference and snpEff database are Mycoplasma_hyopneumoniae_168_l, but I got into trouble.

The annotated vcf file contains empty ID column with ERROR_CHROMOSOME_NOT_FOUND 9986. I read about the error, but could not get any hint how to fix it as I'm newbie.

Would you please help me? Best Regards...

variant annotation • 3.5k views
ADD COMMENT
0
Entering edit mode

what is the output of

grep -v "^#" /mnt/f/mycopn/variantcalling.vcf | cut -f 1 | uniq | sort | uniq

how does it compare to the chromosomes of Mycoplasma_hyopneumoniae_168_l

?

ADD REPLY
0
Entering edit mode

the command you wrote returns: NC_021283.1

and about second part, sorry Did you ask about aligner? I used hisat2.

ADD REPLY
0
Entering edit mode

how did you build/get the snpeff database for Mycoplasma_hyopneumoniae_168_l ?

ADD REPLY
1
Entering edit mode

for example in https://www.ncbi.nlm.nih.gov/nuccore/NC_017509.1 the genome could be named NC_017509.1 in snpeff , and not NC_021283.1

ADD REPLY
0
Entering edit mode

Oh sure, the snpEff downloaded the database itself,

Downloading database for 'Mycoplasma_hyopneumoniae_168_l'
Database installed.
ADD REPLY
0
Entering edit mode

Hi. I have same problem. In my case, when I used the script:

grep -v "^#" sampleAX53.bcftools.filt.vcf| cut -f 1| uniq| sort| uniq

It return:

NZ_CP092040.1 NZ_CP092041.1

Is it normal and I need to chenge this two names? or it's a error in my download reference?

Thanks

ADD REPLY
1
Entering edit mode
2.9 years ago

snpeff uses the data from ensembl: http://ftp.ensemblgenomes.org/pub/current/bacteria/species_EnsemblBacteria.txt

>>> 2
$1                #name : Mycoplasma hyopneumoniae 168 (GCA_000183185)
$2              species : mycoplasma_hyopneumoniae_168_gca_000183185
$3             division : EnsemblBacteria
$4          taxonomy_id : 907287
$5             assembly : ASM18318v1
$6   assembly_accession : GCA_000183185.1
$7            genebuild : 2014-05-HuazhongAgriculturalUniversity
$8            variation : N
$9           microarray : N
$10         pan_compara : N
$11     peptide_compara : N
$12   genome_alignments : N
$13    other_alignments : Y
$14             core_db : bacteria_113_collection_core_52_105_1
$15          species_id : 196
$16                 ??? : 
<<< 2

there is a good chance that your chromosome are not named "NC_021283.1" but CP003131.1 (https://www.ncbi.nlm.nih.gov/assembly/GCA_000400855.1) , or something else.

ADD COMMENT
0
Entering edit mode

Thank you so much. and now I have to build my own database with NC_021283.1 name?

ADD REPLY
0
Entering edit mode

you have to discover the name of the chromosome in the snpeff database and change the name of the contig in the VCF using bcftools annotate --rename-chrs ...

ADD REPLY
0
Entering edit mode

May I ask how to discover the name of the chromosome in the snpeff database? I download my database from NCBI by command below.

java -jar snpEff.jar download -v Escherichia_coli_str_k_12_substr_mg1655
00:00:05 Download finished. Total 1700163 bytes.
00:00:05 Extracting file 'data/Escherichia_coli_str_k_12_substr_mg1655/snpEffectPredictor.bin'
00:00:05 Unzip: OK
00:00:05 Deleted local file 'C:\Users\73605\AppData\Local\Temp\/snpEff_v5_0_Escherichia_coli_str_k_12_substr_mg1655.zip'00:00:05 Done
00:00:05 Logging
00:00:06 Done. 

However, when I run the command, I have the same issue "ERROR_CHROMOSOME_NOT_FOUND". Do you think this is also caused by the wrong database?

PS C:\Users\73605\Bio\SnpEff> java -Xmx4g -jar snpEff.jar Escherichia_coli_str_k_12_substr_mg1655 -c .\snpEff.config -v .\variants.flt.vcf > ann.vcf
00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'ann'
00:00:00 Reading configuration file '.\snpEff.config'. Genome: 'Escherichia_coli_str_k_12_substr_mg1655'
00:00:00 Reading config file: C:\Users\73605\Bio\SnpEff\snpEff.config
00:00:00 done
00:00:00 Reading database for genome version 'Escherichia_coli_str_k_12_substr_mg1655' from file 'C:\Users\73605\Bio\SnpEff/./data/Escherichia_coli_str_k_12_substr_mg1655/snpEffectPredictor.bin' (this might take a while)
00:00:01 done
00:00:01 Loading Motifs and PWMs
00:00:01 Building interval forest
00:00:01 done.
00:00:01 Genome stats :
#-----------------------------------------------
# Genome name                : 'Escherichia_coli_str_k_12_substr_mg1655'
# Genome version             : 'Escherichia_coli_str_k_12_substr_mg1655'
# Genome ID                  : 'Escherichia_coli_str_k_12_substr_mg1655[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 4497
# Protein coding genes       : 4226
#-----------------------------------------------
# Transcripts                : 4497
# Avg. transcripts per gene  : 1.00
# TSL transcripts            : 0
#-----------------------------------------------
# Checked transcripts        :
#               AA sequences :   4140 ( 97.96% )
#              DNA sequences :   4322 ( 96.11% )
#-----------------------------------------------
# Protein coding transcripts : 4226
#              Length errors :     72 ( 1.70% )
#  STOP codons in CDS errors :     13 ( 0.31% )
#         START codon errors :    406 ( 9.61% )
#        STOP codon warnings :     14 ( 0.33% )
#              UTR sequences :      0 ( 0.00% )
#               Total Errors :    406 ( 9.61% )
# WARNING                    : No protein coding transcript has UTR
#-----------------------------------------------
# Cds                        : 4141
# Exons                      : 4564
# Exons with sequence        : 4564
# Exons without sequence     : 0
# Avg. exons per transcript  : 1.01
#-----------------------------------------------
# Number of chromosomes      : 1
# Chromosomes                : Format 'chromo_name size codon_table'
#               'Chromosome'    4641652 Standard
#-----------------------------------------------

00:00:01 Predicting variants
00:00:02        40000 variants (33277 variants per second), 39970 VCF entries
00:00:02        50000 variants (34818 variants per second), 49936 VCF entries
00:00:03        60000 variants (36057 variants per second), 59929 VCF entries
00:00:03        70000 variants (36978 variants per second), 69917 VCF entries
00:00:03        80000 variants (37700 variants per second), 79909 VCF entries
00:00:03        90000 variants (38314 variants per second), 89897 VCF entries
00:00:04        100000 variants (38804 variants per second), 99887 VCF entries

ERRORS: Some errors were detected
Error type      Number of errors
ERROR_CHROMOSOME_NOT_FOUND      106997
ADD REPLY
0
Entering edit mode

Try to change the name of chromosome in vcf to 'Chromosome' – it has worked for me, because in the 2nd line in file of the downloaded db after the length (1042518) I had the name 'Chromosome' like this:

CHROMOSOME 2 1 0 1042518 Chromosome false true

For example, using awk '{gsub(/NC_000117.1/, "Chromosome"); print;}' your_file.vcf > renamed.vcf

ADD REPLY

Login before adding your answer.

Traffic: 2700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6