Hi everyone, I was trying to build the snpEff database for the Human herpesvirus 5 strain Merlin (https://www.ncbi.nlm.nih.gov/nuccore/AY446894.2) using the script provided by SnpEff (buildDbNcbi.sh), and I got the following error described in the Error message section. I think the gen-bank file itself probably causes it. A formatting error or something in the gbk file. Is there anyone who encountered a similar problem? How did you overcome it? What do you suggest?
Note: Later, I tried to build the database manually and got the same error. I updated SnpEff to the 5.1 version and tried again. But I got the same error.
I really appreciate any help you can provide.
To Reproduce
SnpEff version: 5.0
Genome version: AY446894.2
SnpEff full command line: bash ~/path-to-script/buildDbNcbi.sh AY446894.2
Output / Error message: java.lang.RuntimeException: Error reading file '/path-to-data/data/AY446894.2/genes.gbk' java.lang.RuntimeException: Transcript 'HHV5wtgr002' is already in Gene 'HHV5wtgr002'
Expected behavior: Building database
It seems the annotation contains two genes (probably identical?) at different positions (6759..8458 and 8250..8393), but with same name (RL9A) and
(HHV5wtgr002). My guess is snpEff wants unique names for the genes and transcripts.Thank you for your input. I believe you guessed it correctly. I have deleted redundant entries in the GenBank file. I am not sure that was the right approach, but that worked. Also, I was not interested in those regions anyways.