I'm attempting to annotate my M. tuberculosis variant VCF files using SnpEff. The genome build I used for the variant calling was gi|559794742|gb|CP003248.2| and the error I get when running the code below is: EFF=(MODIFIER||||||||||1|ERROR_CHROMOSOME_NOT_FOUND). I attempted to change the name of the chromosome to match that of the genome build (H37RV) used by SnpEff but it did not work. here is the genome link from SnpEff.config ftp://ftp.sanger.ac.uk/pub/pathogens/Mycobacterium/tuberculosis.
I too attempted to change this to the link the the genome I used which returned the same error.
I think snpEff documentation states that reference used should share the same name across alignment, variant calling and variant annotation. Just to address the obvious, are all three the same in your case? I see you mention variant calling and variant annotation, but what about alignment?
I encountered a similar problem and thought I'd rather hijack this thread than create a new one.
I had the problem with chr not found and it was indeed the case of different names used for aligning and annotation in snpEff. The problem is though, that I am working with bacteria and I built the reference for snpEff from a GenBank file which uses a different name for the same exact thing in .fasta and .gbk formats.
fasta: gi|556503834|ref|NC_000913.3|
genbank: NC_000913
You can see from the names that it is indeed the same thing (the genbank one has the same 000913.3 version noted elsewhere), but as in the alignment the whole header is used as reference name it confuses snpEff.
Is it possible to change the name of the reference in snpEff? Or what line do I have to modify in .gbk to use the same name for building the reference as is used in the fasta file?
I got similar issue with my bacterial snp annotation, that is chromosome_not_found error. I am trying to follow your method to modify the vcf file, but couldn't figure out which line.
Here are several lines of my vcf file:
##fileformat=VCFv4.0
##Reference genome=GCA_000008865_1_ASM886v1_genomic
##INFO=<\ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<\ID=AF,Number=.,Type=Float,Description="Allele Frequency">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GCA_000006665_1_ASM666v1_genomic GCA_000008865_1_ASM886v1_genomic
1 494154 AAAAAAACCG.AGCGCAAATA_R G A . . NS=205;AF=0.003 GT 0 0
1 40998 AAAAAAAGCC.TCTTCGTCGC_F T C . . NS=196;AF=0.003 GT 0 0
1 4531974 AAAAAAAGCG.AAATCTGGCA_F C T . . NS=212;AF=0.003 GT 0 0
1 18983 AAAAAAATAG.GCTTCCAGGG_R G T . . NS=197;AF=0.003 GT 0 0
1 1477420 AAAAAAATCC.GCTGCCGATA_R T C . . NS=200;AF=0.006 GT 0 0
1 1013276 AAAAAACCCA.CAACCTTGAA_F T C . . NS=200;AF=0.003 GT 0 0
1 254058 AAAAAACCCA.GGCGGGCGTT_R T C . . NS=206;AF=0.003 GT 0 0
1 461873 AAAAAACCGG.AAACCGGACT_F A C . . NS=167;AF=0.003 GT 0 0
1 2363022 AAAAAACCGG.CAGTTTGAGC_R T C . . NS=181;AF=0.006 GT 0 0
1 494140 AAAAAACCGT.GCGTATTTGC_F G A . . NS=204;AF=0.003 GT 0 0
You can see my comment above A: snpEFF chr not found error bacterial genome.
Basically, SNPeff uses the Ensembl genomes and the chromosome name for bacterial genomes is "Chromosome", so you either have to change the naming of the #CHROM if your vcf file to this as explained above or create your own database using the NCBI reference and the script in the SNPeff folder called buildDbNcbi.sh. Let me know if that's unclear.
sed -i 's/CurrentName_inVCF/ReplacementName/g' "FileToEdit.vcf"
This will replace the CurrentName_inVCF with the ReplacementName (the name you want to change it to), and the -i flag will do it in place, you can remove the -i and pipe it to a file like you did. If you don't get it working you can send me a file and I can check it out.
Can you help me more? How can I know CurrentName in my vcf file ? I am pretty new here. I will give one or two more try and then if I don't succeed I will send it to you.
Thanks
Sanjay
Sanjay Gautam, B.Sc. MLT, M.Sc. Med. Microbiology
PhD Scholar/School of Medicine/ Breathe Well Center
University of Tasmania, 17 Liverpool Street, Hobart TAS 7000
M +61416181795 / E.mail: Sanjay.Gautam@utas.edu.au
Not sure, maybe you aligned to a different reference?
java -jar snpEff.jar download Mycobacterium_tuberculosis_h37rv_gca_000277735
# download the reference you spoke of
cd "Mycobacterium_tuberculosis_h37rv_gca_000277735"
zcat snpEffectPredictor.bin | head -n 2
# Its chromosome name is "Chromosome"
# Make sure this is the exact same as the fasta you aligned to or you will get incorrect results.
#check the name of the CHROM in your vcf file
awk '{print $1}' test.vcf | tail -n1
#convert your chrom names to "Chromosome", i my case its NC_000962.3
sed 's/NC_000962.3/Chromosome/g' "test.vcf" > test.edited.vcf
#run snp eff
java -jar -v snpEff.jar "Mycobacterium_tuberculosis_h37rv_gca_000277735" test.edited.vcf > test.ann.vcf
# it annotates but I get the error "WARNING_REF_DOES_NOT_MATCH_GENOME" for some positions, this is because i aligned to a different reference "NC_000962.3" not Mycobacterium_tuberculosis_h37rv_gca_000277735
The output is SNP=0 INS=1428 Del=48 MIXED=6. I don't understand why it is putting all SNP in INS category. The numbers are different from previous result as I changed threshold settings in varscan.
Is it because I am using varscan to call variants that meet desired thresholds where I used Varscan output file (.vcf) using the command:
This does not happen if I do not use varscan .vcf. If I use .vcf file converted from .bcf directly to snpEff then I can see the SNP numbers. I am using varscan just to define the threshold. Do you think there is next option for this?
Your vcf file isn't formatted correctly, it doesn't have headers. I don't use varscan but maybe look at its output options (see the manual: --output-vcf If set to 1, outputs in VCF format). Otherwise, I use vcftools for filtering which you could try.
I have identified the SNP from draft genome and annotated the vcf variant file in snpEff. while annotating i found an error CHROMOSOME_NOT_FOUND and also in annotated file EFF=(MODIFIER||||||||||1|ERROR_CHROMOSOME_NOT_FOUND). The database for that strain was not available so created as mentioned in tutorial. I am unable to fix the problem. Please do suggest me where to change the name of chromosome because its in contigs form.
They way I got around that is to check the SnpEff file for the chromosome name it uses. Just gunzip the .bin file in the SnpEff data folder which matches your reference (they use the Ensembl naming, I believe you are looking for GCA_000005845.2). In my case the name of the chromosome was Chromosome. Then you can just run the below as a shell script and it will replace you reference name with the one SnpEff uses, then you can run SnpEff. Following that you can use the same code with the reverse sed argument and rename the chromosome correctly again. Alternatively you can rename the chromosome in the SnpEff file. Just make sure you are using the correct reference.
cd "folder with VCFs"
for f in $(ls *.vcf)
do
sed -i 's/gi|556503834|ref|NC_000913.3|/Chromosome/g' "$f"
echo "Processing $f"
done
I think snpEff documentation states that reference used should share the same name across alignment, variant calling and variant annotation. Just to address the obvious, are all three the same in your case? I see you mention variant calling and variant annotation, but what about alignment?
HI RamRS, I did use the same reference across alignment (BWA), variant calling (GATK) and for the annotation.
Thank you for that clarification.
Hey everyone.
I encountered a similar problem and thought I'd rather hijack this thread than create a new one.
I had the problem with chr not found and it was indeed the case of different names used for aligning and annotation in snpEff. The problem is though, that I am working with bacteria and I built the reference for snpEff from a GenBank file which uses a different name for the same exact thing in .fasta and .gbk formats.
fasta:
gi|556503834|ref|NC_000913.3|
genbank:
NC_000913
You can see from the names that it is indeed the same thing (the genbank one has the same 000913.3 version noted elsewhere), but as in the alignment the whole header is used as reference name it confuses snpEff.
Is it possible to change the name of the reference in snpEff? Or what line do I have to modify in .gbk to use the same name for building the reference as is used in the fasta file?
See below answer
Hi, ScienceBuff,
I got similar issue with my bacterial snp annotation, that is chromosome_not_found error. I am trying to follow your method to modify the vcf file, but couldn't figure out which line.
Here are several lines of my vcf file:
Here is my snp call reference headers:
Here is the snpeff data bin file (downloaded from snpeff database):
Could you please help me with it?
Thank you very much for your time.
C.
Hi ScienceBuff,
I hope you managed to solve your query regarding snpEff/MTB. I am having the same problem. Can you suggest me to solve this one?Thanks
Hi Sanjay.
You can see my comment above A: snpEFF chr not found error bacterial genome. Basically, SNPeff uses the Ensembl genomes and the chromosome name for bacterial genomes is "Chromosome", so you either have to change the naming of the #CHROM if your vcf file to this as explained above or create your own database using the NCBI reference and the script in the SNPeff folder called buildDbNcbi.sh. Let me know if that's unclear.
Hi ScienceBuff,
Can you please guide me how can I change the naming of CHROM to Chromosome?
I tried to follow http://snpeff.sourceforge.net/SnpEff_faq.html
I used
cat variant.vcf | grep -v "^#" | cut -f 1 | uniq
to view the chromosome name and came up withas an output.
Now I have to use
cat input.vcf | sed "s/^INPUT_CHR_NAME/SNPEFF_CHR_NAME/" > input_updated_chr.vcf
to change the chromosome name.I am wondering which one is input chromosome name and which one is SNPEFF
Thank you
Hi Sanjay.
This will replace the
CurrentName_inVCF
with theReplacementName
(the name you want to change it to), and the-i
flag will do it in place, you can remove the-i
and pipe it to a file like you did. If you don't get it working you can send me a file and I can check it out.Hi
Can you help me more? How can I know CurrentName in my vcf file ? I am pretty new here. I will give one or two more try and then if I don't succeed I will send it to you.
Thanks
Sanjay
Sanjay Gautam, B.Sc. MLT, M.Sc. Med. Microbiology
PhD Scholar/School of Medicine/ Breathe Well Center
University of Tasmania, 17 Liverpool Street, Hobart TAS 7000
M +61416181795 / E.mail: Sanjay.Gautam@utas.edu.au
You can use
Hi,
I finally fixed the problem of
ERROR_CHROMOSOME_NOT_FOUND
. Thank you for the suggestion.But, the snpEff output shows
SNP=0 INS=1508 Del=51 MIXED=6
.However in varscan, for the same file, there were 1484 SNP and 75 Indels out of 1550 SNV's.
I ran a command:
to annotate variants in vcf file that was produced using command:
Thank you
Not sure, maybe you aligned to a different reference?
Hi,
Thank you for your suggestions.
I did align the file with
Mycobacterium_tuberculosis_h37rv_gca_000277735
I changed the name to "Chromosome" and I double checked it to confirm. Then I ran with snpEff using the same command:
The output is
SNP=0 INS=1428 Del=48 MIXED=6
. I don't understand why it is putting all SNP in INS category. The numbers are different from previous result as I changed threshold settings in varscan.Is it because I am using varscan to call variants that meet desired thresholds where I used Varscan output file (.vcf) using the command:
This does not happen if I do not use varscan .vcf. If I use .vcf file converted from .bcf directly to snpEff then I can see the SNP numbers. I am using varscan just to define the threshold. Do you think there is next option for this?
I have added variants.vcf file and .bcf file in google drive. I wonder if you can please have a look at it and suggest me best possible options. Files here: https://drive.google.com/open?id=12thrtc8Jg3PWg5G_yqrL6yrcKma7yhv7
Thank you
Your vcf file isn't formatted correctly, it doesn't have headers. I don't use varscan but maybe look at its output options (see the manual: --output-vcf If set to 1, outputs in VCF format). Otherwise, I use vcftools for filtering which you could try.
So try
Hi,
Thanks for your help. It finally worked.
Just wondering what commands do you use to filter variants in vcftools?
Glad it finally worked. Depends what Im doing but generally for SNPs.
HI,
I have identified the SNP from draft genome and annotated the vcf variant file in snpEff. while annotating i found an error
CHROMOSOME_NOT_FOUND
and also in annotated fileEFF=(MODIFIER||||||||||1|ERROR_CHROMOSOME_NOT_FOUND)
. The database for that strain was not available so created as mentioned in tutorial. I am unable to fix the problem. Please do suggest me where to change the name of chromosome because its in contigs form.See below, you can just change it in your vcf file, or in the fasta you used to generate the database. (also see here)