Question

snpEff: ERROR_CHROMOSOME_NOT_FOUND

0

Entering edit mode

3.4 years ago

ziziqolo ▴ 10

Hey BioStars

I'm trying to annotate my bacteria variants with SnpEff,

java -jar snpEff.jar -v Mycoplasma_hyopneumoniae_168_l /mnt/f/mycopn/variantcalling.vcf > /mnt/f/mycopn/res.ann.vcf

All my Reference and snpEff database are Mycoplasma_hyopneumoniae_168_l, but I got into trouble.

The annotated vcf file contains empty ID column with ERROR_CHROMOSOME_NOT_FOUND 9986. I read about the error, but could not get any hint how to fix it as I'm newbie.

Would you please help me? Best Regards...

variant annotation • 4.2k views

ADD COMMENT • link updated 9 months ago by Eugenia • 0 • written 3.4 years ago by ziziqolo ▴ 10

0

Entering edit mode

what is the output of

grep -v "^#" /mnt/f/mycopn/variantcalling.vcf | cut -f 1 | uniq | sort | uniq

how does it compare to the chromosomes of Mycoplasma_hyopneumoniae_168_l

?

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

the command you wrote returns: NC_021283.1

and about second part, sorry Did you ask about aligner? I used hisat2.

ADD REPLY • link 3.4 years ago by ziziqolo ▴ 10

0

Entering edit mode

how did you build/get the snpeff database for Mycoplasma_hyopneumoniae_168_l ?

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

for example in https://www.ncbi.nlm.nih.gov/nuccore/NC_017509.1 the genome could be named NC_017509.1 in snpeff , and not NC_021283.1

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Oh sure, the snpEff downloaded the database itself,

Downloading database for 'Mycoplasma_hyopneumoniae_168_l'
Database installed.

ADD REPLY • link 3.4 years ago by ziziqolo ▴ 10

0

Entering edit mode

Hi. I have same problem. In my case, when I used the script:

grep -v "^#" sampleAX53.bcftools.filt.vcf| cut -f 1| uniq| sort| uniq

It return:

NZ_CP092040.1 NZ_CP092041.1

Is it normal and I need to chenge this two names? or it's a error in my download reference?

Thanks

ADD REPLY • link 20 months ago by Marco Antonio • 0

score 1 · Answer 1 · 2022-01-15

1

Entering edit mode

3.4 years ago

Pierre Lindenbaum 166k

snpeff uses the data from ensembl: http://ftp.ensemblgenomes.org/pub/current/bacteria/species_EnsemblBacteria.txt

>>> 2
$1                #name : Mycoplasma hyopneumoniae 168 (GCA_000183185)
$2              species : mycoplasma_hyopneumoniae_168_gca_000183185
$3             division : EnsemblBacteria
$4          taxonomy_id : 907287
$5             assembly : ASM18318v1
$6   assembly_accession : GCA_000183185.1
$7            genebuild : 2014-05-HuazhongAgriculturalUniversity
$8            variation : N
$9           microarray : N
$10         pan_compara : N
$11     peptide_compara : N
$12   genome_alignments : N
$13    other_alignments : Y
$14             core_db : bacteria_113_collection_core_52_105_1
$15          species_id : 196
$16                 ??? : 
<<< 2

there is a good chance that your chromosome are not named "NC_021283.1" but CP003131.1 (https://www.ncbi.nlm.nih.gov/assembly/GCA_000400855.1) , or something else.

ADD COMMENT • link 3.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you so much. and now I have to build my own database with NC_021283.1 name?

ADD REPLY • link 3.4 years ago by ziziqolo ▴ 10

0

Entering edit mode

you have to discover the name of the chromosome in the snpeff database and change the name of the contig in the VCF using bcftools annotate --rename-chrs ...

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

May I ask how to discover the name of the chromosome in the snpeff database? I download my database from NCBI by command below.

java -jar snpEff.jar download -v Escherichia_coli_str_k_12_substr_mg1655
00:00:05 Download finished. Total 1700163 bytes.
00:00:05 Extracting file 'data/Escherichia_coli_str_k_12_substr_mg1655/snpEffectPredictor.bin'
00:00:05 Unzip: OK
00:00:05 Deleted local file 'C:\Users\73605\AppData\Local\Temp\/snpEff_v5_0_Escherichia_coli_str_k_12_substr_mg1655.zip'00:00:05 Done
00:00:05 Logging
00:00:06 Done.

However, when I run the command, I have the same issue "ERROR_CHROMOSOME_NOT_FOUND". Do you think this is also caused by the wrong database?

PS C:\Users\73605\Bio\SnpEff> java -Xmx4g -jar snpEff.jar Escherichia_coli_str_k_12_substr_mg1655 -c .\snpEff.config -v .\variants.flt.vcf > ann.vcf
00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'ann'
00:00:00 Reading configuration file '.\snpEff.config'. Genome: 'Escherichia_coli_str_k_12_substr_mg1655'
00:00:00 Reading config file: C:\Users\73605\Bio\SnpEff\snpEff.config
00:00:00 done
00:00:00 Reading database for genome version 'Escherichia_coli_str_k_12_substr_mg1655' from file 'C:\Users\73605\Bio\SnpEff/./data/Escherichia_coli_str_k_12_substr_mg1655/snpEffectPredictor.bin' (this might take a while)
00:00:01 done
00:00:01 Loading Motifs and PWMs
00:00:01 Building interval forest
00:00:01 done.
00:00:01 Genome stats :
#-----------------------------------------------
# Genome name                : 'Escherichia_coli_str_k_12_substr_mg1655'
# Genome version             : 'Escherichia_coli_str_k_12_substr_mg1655'
# Genome ID                  : 'Escherichia_coli_str_k_12_substr_mg1655[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 4497
# Protein coding genes       : 4226
#-----------------------------------------------
# Transcripts                : 4497
# Avg. transcripts per gene  : 1.00
# TSL transcripts            : 0
#-----------------------------------------------
# Checked transcripts        :
#               AA sequences :   4140 ( 97.96% )
#              DNA sequences :   4322 ( 96.11% )
#-----------------------------------------------
# Protein coding transcripts : 4226
#              Length errors :     72 ( 1.70% )
#  STOP codons in CDS errors :     13 ( 0.31% )
#         START codon errors :    406 ( 9.61% )
#        STOP codon warnings :     14 ( 0.33% )
#              UTR sequences :      0 ( 0.00% )
#               Total Errors :    406 ( 9.61% )
# WARNING                    : No protein coding transcript has UTR
#-----------------------------------------------
# Cds                        : 4141
# Exons                      : 4564
# Exons with sequence        : 4564
# Exons without sequence     : 0
# Avg. exons per transcript  : 1.01
#-----------------------------------------------
# Number of chromosomes      : 1
# Chromosomes                : Format 'chromo_name size codon_table'
#               'Chromosome'    4641652 Standard
#-----------------------------------------------

00:00:01 Predicting variants
00:00:02        40000 variants (33277 variants per second), 39970 VCF entries
00:00:02        50000 variants (34818 variants per second), 49936 VCF entries
00:00:03        60000 variants (36057 variants per second), 59929 VCF entries
00:00:03        70000 variants (36978 variants per second), 69917 VCF entries
00:00:03        80000 variants (37700 variants per second), 79909 VCF entries
00:00:03        90000 variants (38314 variants per second), 89897 VCF entries
00:00:04        100000 variants (38804 variants per second), 99887 VCF entries

ERRORS: Some errors were detected
Error type      Number of errors
ERROR_CHROMOSOME_NOT_FOUND      106997

ADD REPLY • link 2.2 years ago by Jane • 0

0

Entering edit mode

Try to change the name of chromosome in vcf to 'Chromosome' – it has worked for me, because in the 2nd line in file of the downloaded db after the length (1042518) I had the name 'Chromosome' like this:

CHROMOSOME 2 1 0 1042518 Chromosome false true

For example, using awk '{gsub(/NC_000117.1/, "Chromosome"); print;}' your_file.vcf > renamed.vcf

ADD REPLY • link 9 months ago by Eugenia • 0