Hi everyone, I was trying to build the snpEff database for the Human herpesvirus 5 strain Merlin (https://www.ncbi.nlm.nih.gov/nuccore/AY446894.2) using the script provided by SnpEff (buildDbNcbi.sh), and I got the following error described in the Error message section. I think the gen-bank file itself probably causes it. A formatting error or something in the gbk file. Is there anyone who encountered a similar problem? How did you overcome it? What do you suggest?
Note: Later, I tried to build the database manually and got the same error. I updated SnpEff to the 5.1 version and tried again. But I got the same error.
I really appreciate any help you can provide.
To Reproduce
SnpEff version: 5.0
Genome version: AY446894.2
SnpEff full command line: bash ~/path-to-script/buildDbNcbi.sh AY446894.2
Output / Error message: java.lang.RuntimeException: Error reading file '/path-to-data/data/AY446894.2/genes.gbk' java.lang.RuntimeException: Transcript 'HHV5wtgr002' is already in Gene 'HHV5wtgr002'
Expected behavior: Building database
It seems the annotation contains two genes (probably identical?) at different positions (6759..8458 and 8250..8393), but with same name (RL9A) and
locus_tag
(HHV5wtgr002). My guess is snpEff wants unique names for the genes and transcripts.Thank you for your input. I believe you guessed it correctly. I have deleted redundant entries in the GenBank file. I am not sure that was the right approach, but that worked. Also, I was not interested in those regions anyways.