Question

snpEFF chr not found error bacterial genome

0

Entering edit mode

9.9 years ago

SemiQuant ▴ 80

HI,

I'm attempting to annotate my M. tuberculosis variant VCF files using SnpEff. The genome build I used for the variant calling was gi|559794742|gb|CP003248.2| and the error I get when running the code below is: EFF=(MODIFIER||||||||||1|ERROR_CHROMOSOME_NOT_FOUND). I attempted to change the name of the chromosome to match that of the genome build (H37RV) used by SnpEff but it did not work. here is the genome link from SnpEff.config ftp://ftp.sanger.ac.uk/pub/pathogens/Mycobacterium/tuberculosis.

I too attempted to change this to the link the the genome I used which returned the same error.

java -jar snpEff.jar -i vcf -o vcf m_tuberculosis ./test1.vcf

snpeff SNP annotation • 10k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by SemiQuant ▴ 80

0

Entering edit mode

I think snpEff documentation states that reference used should share the same name across alignment, variant calling and variant annotation. Just to address the obvious, are all three the same in your case? I see you mention variant calling and variant annotation, but what about alignment?

ADD REPLY • link 9.9 years ago by Ram 44k

1

Entering edit mode

HI RamRS, I did use the same reference across alignment (BWA), variant calling (GATK) and for the annotation.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by SemiQuant ▴ 80

0

Entering edit mode

Thank you for that clarification.

ADD REPLY • link 2.7 years ago by Ram 44k

0

Entering edit mode

Hey everyone.

I encountered a similar problem and thought I'd rather hijack this thread than create a new one.

I had the problem with chr not found and it was indeed the case of different names used for aligning and annotation in snpEff. The problem is though, that I am working with bacteria and I built the reference for snpEff from a GenBank file which uses a different name for the same exact thing in .fasta and .gbk formats.

fasta: gi|556503834|ref|NC_000913.3|

genbank: NC_000913

You can see from the names that it is indeed the same thing (the genbank one has the same 000913.3 version noted elsewhere), but as in the alignment the whole header is used as reference name it confuses snpEff.

Is it possible to change the name of the reference in snpEff? Or what line do I have to modify in .gbk to use the same name for building the reference as is used in the fasta file?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by markus.lippus ▴ 10

0

Entering edit mode

See below answer

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by SemiQuant ▴ 80

0

Entering edit mode

Hi, ScienceBuff,

I got similar issue with my bacterial snp annotation, that is chromosome_not_found error. I am trying to follow your method to modify the vcf file, but couldn't figure out which line.

Here are several lines of my vcf file:

##fileformat=VCFv4.0
##Reference genome=GCA_000008865_1_ASM886v1_genomic
##INFO=<\ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<\ID=AF,Number=.,Type=Float,Description="Allele Frequency">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  GCA_000006665_1_ASM666v1_genomic    GCA_000008865_1_ASM886v1_genomic
1   494154  AAAAAAACCG.AGCGCAAATA_R G   A   .   .   NS=205;AF=0.003 GT  0   0
1   40998   AAAAAAAGCC.TCTTCGTCGC_F T   C   .   .   NS=196;AF=0.003 GT  0   0
1   4531974 AAAAAAAGCG.AAATCTGGCA_F C   T   .   .   NS=212;AF=0.003 GT  0   0
1   18983   AAAAAAATAG.GCTTCCAGGG_R G   T   .   .   NS=197;AF=0.003 GT  0   0
1   1477420 AAAAAAATCC.GCTGCCGATA_R T   C   .   .   NS=200;AF=0.006 GT  0   0
1   1013276 AAAAAACCCA.CAACCTTGAA_F T   C   .   .   NS=200;AF=0.003 GT  0   0
1   254058  AAAAAACCCA.GGCGGGCGTT_R T   C   .   .   NS=206;AF=0.003 GT  0   0
1   461873  AAAAAACCGG.AAACCGGACT_F A   C   .   .   NS=167;AF=0.003 GT  0   0
1   2363022 AAAAAACCGG.CAGTTTGAGC_R T   C   .   .   NS=181;AF=0.006 GT  0   0
1   494140  AAAAAACCGT.GCGTATTTGC_F G   A   .   .   NS=204;AF=0.003 GT  0   0

Here is my snp call reference headers:

[login-node04 databaseWorkingO17]$ grep ">" GCA_000008865.1_ASM886v1_genomic.fna
>BA000007.2 Escherichia coli O157:H7 str. Sakai DNA, complete genome
>AB011549.2 Escherichia coli O157:H7 str. Sakai plasmid pO157 DNA, complete sequence
>AB011548.2 Escherichia coli O157:H7 str. Sakai plasmid pOSAK1 DNA, complete sequence

Here is the snpeff data bin file (downloaded from snpeff database):

[login-node03 Escherichia_coli_o157_h7_str_sakai]$ gunzip -c snpEffectPredictor.bin|head -20
SnpEff  4.3
CHROMOSOME  2   1   0   5498449 Chromosome  false
CHROMOSOME  3   1   0   92720   pO157   false
CHROMOSOME  4   1   0   3305    pOSAK1  false
GENOME  1   -1  0   2147483647  Escherichia_coli_o157_h7_str_sakai  false   Escherichia_coli_o157_h7_str_sakai  Escherichia_coli_o157_h7_str_sakai  2,3,4
EXON    7   6   725366  725474  EBG00001089957-1    false   cccaaaagaaaaccctcaccgtcaggcggcgagggtttaactcacatgatgatactgactgttgctcactctttgaagtgatttgcgtcacattcagggaattcctcaa   -1  1   cccaaaagaaaaccctcaccgtcaggcggcgagggtttaactcacatgatgatactgactgttgctcactctttgaagtgatttgcgtcacattcagggaattcctcaa   RETAINED
TRANSCRIPT  6   5   725366  725474  EBT00001692105  false   7   lincRNA false   false   false   false   false       1   -1  -1  
GENE    5   2   725366  725474  EBG00001089957  false   6   rnk_leader  lincRNA
EXON    10  9   2143588 2143965 BAB35560-1  false   atggttaatcagaagaaagatcgtctgcttaacgagtatctgtctccgctggatattaccgcggcacagtttaaggtgctctgctctatccgctgcgcggcgtgtattactccggttgaactgaaaaaagtgttgtcggtcgacctgggagcactgacccgtatgctggatcgcctggtctgtaaaggctgggtagaaaggttgccgaacccgaatgataagcgcggcgtactggtaaaacttaccaccagcggcgcggcaatatgtgaacaatgccatcaattagttggccaggacctgcatcaagaattaacaaaaaacctgacggcggacgaagtggcaacacttgagcatttgcttaagaaagtcctgccgtaa  0   1   atggttaatcagaagaaagatcgtctgcttaacgagtatctgtctccgctggatattaccgcggcacagtttaaggtgctctgctctatccgctgcgcggcgtgtattactccggttgaactgaaaaaagtgttgtcggtcgacctgggagcactgacccgtatgctggatcgcctggtctgtaaaggctgggtagaaaggttgccgaacccgaatgataagcgcggcgtactggtaaaacttaccaccagcggcgcggcaatatgtgaacaatgccatcaattagttggccaggacctgcatcaagaattaacaaaaaacctgacggcggacgaagtggcaacacttgagcatttgcttaagaaagtcctgccgtaa  RETAINED
CDS 11  9   2143588 2143965 CDS_Chromosome_2143589_2143963  false   0
TRANSCRIPT  9   8   2143588 2143965 BAB35560    false   10  protein_coding  true    false   true    false   false       1   -1  -1      11
GENE    8   2   2143588 2143965 BAB35560    false   9   ECs2137 protein_coding

Could you please help me with it?

Thank you very much for your time.

C.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 7.7 years ago by capricygcapricyg • 0

0

Entering edit mode

Hi ScienceBuff,

I hope you managed to solve your query regarding snpEff/MTB. I am having the same problem. Can you suggest me to solve this one?Thanks

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by sanjay.bpkihs • 0

0

Entering edit mode

Hi Sanjay.

You can see my comment above A: snpEFF chr not found error bacterial genome. Basically, SNPeff uses the Ensembl genomes and the chromosome name for bacterial genomes is "Chromosome", so you either have to change the naming of the #CHROM if your vcf file to this as explained above or create your own database using the NCBI reference and the script in the SNPeff folder called buildDbNcbi.sh. Let me know if that's unclear.

ADD REPLY • link 6.6 years ago by SemiQuant ▴ 80

0

Entering edit mode

Hi ScienceBuff,

Can you please guide me how can I change the naming of CHROM to Chromosome?

I tried to follow http://snpeff.sourceforge.net/SnpEff_faq.html

I used cat variant.vcf | grep -v "^#" | cut -f 1 | uniq to view the chromosome name and came up with

Chromosome
Chromosomeosome

as an output.

Now I have to use cat input.vcf | sed "s/^INPUT_CHR_NAME/SNPEFF_CHR_NAME/" > input_updated_chr.vcf to change the chromosome name.

I am wondering which one is input chromosome name and which one is SNPEFF

Thank you

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by sanjay.bpkihs • 0

0

Entering edit mode

Hi Sanjay.

sed -i 's/CurrentName_inVCF/ReplacementName/g' "FileToEdit.vcf"

This will replace the CurrentName_inVCF with the ReplacementName (the name you want to change it to), and the -i flag will do it in place, you can remove the -i and pipe it to a file like you did. If you don't get it working you can send me a file and I can check it out.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by SemiQuant ▴ 80

0

Entering edit mode

Hi

Can you help me more? How can I know CurrentName in my vcf file ? I am pretty new here. I will give one or two more try and then if I don't succeed I will send it to you.

Thanks
Sanjay

Sanjay Gautam, B.Sc. MLT, M.Sc. Med. Microbiology
PhD Scholar/School of Medicine/ Breathe Well Center
University of Tasmania, 17 Liverpool Street, Hobart TAS 7000
M +61416181795 / E.mail: Sanjay.Gautam@utas.edu.au

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by sanjay.bpkihs • 0

0

Entering edit mode

You can use

awk '{print $1}' myFile.vcf | tail -n1

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by SemiQuant ▴ 80

0

Entering edit mode

Hi,

I finally fixed the problem of ERROR_CHROMOSOME_NOT_FOUND. Thank you for the suggestion.

But, the snpEff output shows SNP=0 INS=1508 Del=51 MIXED=6.

However in varscan, for the same file, there were 1484 SNP and 75 Indels out of 1550 SNV's.

I ran a command:

java -Xmx4g -jar snpEff.jar Mycobacterium_tuberculosis_h37rv_gca_000277735 variants.vcf > R_variants.ann.vcf -v

to annotate variants in vcf file that was produced using command:

java -jar VarScan.v2.3.9.jar mpileup2cns mydata.mpileup --min-coverage 5 --min-var-freq 0.90 --p-value 0.005 --variants > variants.vcf

Thank you

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by sanjay.bpkihs • 0

0

Entering edit mode

Not sure, maybe you aligned to a different reference?

java -jar snpEff.jar download Mycobacterium_tuberculosis_h37rv_gca_000277735
# download the reference you spoke of
cd "Mycobacterium_tuberculosis_h37rv_gca_000277735"
zcat snpEffectPredictor.bin | head -n 2
# Its chromosome name is "Chromosome"
# Make sure this is the exact same as the fasta you aligned to or you will get incorrect results.

#check the name of the CHROM in your vcf file
awk '{print $1}' test.vcf | tail -n1

#convert your chrom names to "Chromosome", i my case its NC_000962.3
sed 's/NC_000962.3/Chromosome/g' "test.vcf" > test.edited.vcf

#run snp eff
java -jar -v snpEff.jar "Mycobacterium_tuberculosis_h37rv_gca_000277735" test.edited.vcf > test.ann.vcf

# it annotates but I get the error "WARNING_REF_DOES_NOT_MATCH_GENOME" for some positions, this is because i aligned to a different reference "NC_000962.3" not Mycobacterium_tuberculosis_h37rv_gca_000277735

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by SemiQuant ▴ 80

0

Entering edit mode

Hi,

Thank you for your suggestions.

I did align the file with Mycobacterium_tuberculosis_h37rv_gca_000277735

I changed the name to "Chromosome" and I double checked it to confirm. Then I ran with snpEff using the same command:

java -Xmx4g -jar snpEff.jar Mycobacterium_tuberculosis_h37rv_gca_000277735  variants.vcf  > variants.ann.vcf -v

The output is SNP=0 INS=1428 Del=48 MIXED=6. I don't understand why it is putting all SNP in INS category. The numbers are different from previous result as I changed threshold settings in varscan.

Is it because I am using varscan to call variants that meet desired thresholds where I used Varscan output file (.vcf) using the command:

java -jar VarScan.v2.3.9.jar mpileup2cns mydata.mpileup --min-coverage 10 --min-var-freq 0.90 --p-value 0.005 --variants > variants.vcf.

This does not happen if I do not use varscan .vcf. If I use .vcf file converted from .bcf directly to snpEff then I can see the SNP numbers. I am using varscan just to define the threshold. Do you think there is next option for this?

I have added variants.vcf file and .bcf file in google drive. I wonder if you can please have a look at it and suggest me best possible options. Files here: https://drive.google.com/open?id=12thrtc8Jg3PWg5G_yqrL6yrcKma7yhv7

Thank you

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by sanjay.bpkihs • 0

0

Entering edit mode

Your vcf file isn't formatted correctly, it doesn't have headers. I don't use varscan but maybe look at its output options (see the manual: --output-vcf If set to 1, outputs in VCF format). Otherwise, I use vcftools for filtering which you could try.

So try

java -jar VarScan.v2.3.9.jar mpileup2cns mydata.mpileup --min-coverage 10 --min-var-freq 0.90 --p-value 0.005 --variants --output-vcf 1 > variants.vcf

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by SemiQuant ▴ 80

0

Entering edit mode

Hi,

Thanks for your help. It finally worked.

Just wondering what commands do you use to filter variants in vcftools?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by sanjay.bpkihs • 0

0

Entering edit mode

Glad it finally worked. Depends what Im doing but generally for SNPs.

bcftools isec -i "MIN(DP)>10 & QUAL>30 & (MAF[0]==0 | MAF[0]>0.2)" -e "TYPE='indels'" \
  --threads 4 -n+2 -O z $vcfA $vcfB -p tmp bcftools filter --SnpGap 5 tmp/0000.vcf.gz > tmp.filt.vcf

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 6.6 years ago by SemiQuant ▴ 80

0

Entering edit mode

HI,

I have identified the SNP from draft genome and annotated the vcf variant file in snpEff. while annotating i found an error CHROMOSOME_NOT_FOUND and also in annotated file EFF=(MODIFIER||||||||||1|ERROR_CHROMOSOME_NOT_FOUND). The database for that strain was not available so created as mentioned in tutorial. I am unable to fix the problem. Please do suggest me where to change the name of chromosome because its in contigs form.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 4.2 years ago by Sbioinfo • 0

0

Entering edit mode

See below, you can just change it in your vcf file, or in the fasta you used to generate the database. (also see here)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 4.2 years ago by SemiQuant ▴ 80

Ram · Accepted Answer · 2015-02-11

2

Entering edit mode

9.8 years ago

SemiQuant ▴ 80

Hi Markus,

They way I got around that is to check the SnpEff file for the chromosome name it uses. Just gunzip the .bin file in the SnpEff data folder which matches your reference (they use the Ensembl naming, I believe you are looking for GCA_000005845.2). In my case the name of the chromosome was Chromosome. Then you can just run the below as a shell script and it will replace you reference name with the one SnpEff uses, then you can run SnpEff. Following that you can use the same code with the reverse sed argument and rename the chromosome correctly again. Alternatively you can rename the chromosome in the SnpEff file. Just make sure you are using the correct reference.

cd "folder with VCFs"
for f in $(ls *.vcf)
do
  sed -i 's/gi|556503834|ref|NC_000913.3|/Chromosome/g' "$f"
  echo "Processing $f"
done

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by SemiQuant ▴ 80

0

Entering edit mode

Hi SemiQuant,

Thank you for your answer!

Today I have encountered the same problem. I could find your answer in SnpEff documentation: http://snpeff.sourceforge.net/SnpEff_faq.html

Damianos

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 5.5 years ago by Damianos P. Melidis ▴ 40

1

Entering edit mode

Great, glad it helped!

ADD REPLY • link 5.4 years ago by SemiQuant ▴ 80