Repbase has multiple releases a year, so I'm trying to build a script to reformat the concatenated .embls to a format similar to the last database released for RepeatMasker. Every line starts with a couplet that denotes the type of information contained. The nucleotide sequence lines are not shown but are formatted identically.
Here is the format of the latest RepeatMasker .embl
ID GYPSY68-LTR_AG repeatmasker; DNA; ANG; 108 BP.
CC GYPSY68-LTR_AG DNA
XX
XX
KW DNA/.
XX
CC consensus - See RepBase for additional annotations.
XX
CC RepeatMasker Annotations:
CC Type: DNA
CC SubType:
CC Species: root
CC SearchStages:
CC BufferStages:
And here is the formatting I have achieved so far. The ??? are not always present but denote absence of subfamily. There are several entries in the RepeatMasker library with missing fields for the entire annotation section.
ID IS1 repeatmasker; DNA; ???; 768 BP.
CC IS1 DNA
XX
KW ARTEFACT/.
XX
CC consensus - See RepBase for additional annotations.
XX
CC RepeatMasker Annotations:
CC Type: ARTEFACT
CC SubType:
CC Species: root
CC SearchStages: 10
CC BufferStages:
And here is my attempt at coding.
#!/usr/bin/python
input = open('Repbase.embl', 'r')
###concatenated files of new release
output = open('RepeatMaskerLib.embl','w')
statement="""CC ****************************************************************
CC *
CC RepeatMasker Database *
CC (C) 1997-2011 Genetic Information Research Institute *
CC All rights reserved *
CC *
CC Prepared by: Smit, A., Hubley, R. *
CC See accompanying README.html/README.txt for details. *
CC *
CC RELEASE YEARHERE; *
CC *
CC RepeatMasker software and database development and *
CC maintenance are currently funded by an NIH/NHGRI *
CC R01 grant HG02939-01 to Arian Smit. RepBase Update *
CC development and maintenance are funded by NIH/NLM grant *
CC No.2P41LM006252-07A1 to Jerzy Jurka. *
CC *
CC ****************************************************************
XX"""
output.write(statement + "\n")
badlines=('DT','DE','AC','RN','RP','RA', 'KW','RT','RL','DR', 'FH', 'FT', 'OS', 'OC', 'NM', 'CC', 'RX', 'RC')
###Comment lines start with character couplets, these are not used in the RepeatMasker .embl
def skip_badman(file):
for line in file:
if not line.startswith(badlines):
yield line
for line in skip_badman(input):
####Here I'm hijacking the ID line as the place to jump in and reinsert only the comment lines used in the latest released RepeatMasker database file
if line.startswith('ID'):
new_line = line.split()
output.write(line.replace('repbase', 'repeatmasker'))
output.write("CC" + " " + new_line[1] + " " + new_line[3].replace(';', '') + "\n")
output.write("XX" + "\n")
output.write("DE" + " RepbaseID: " + new_line[1] + "\n")
output.write("XX" + "\n")
output.write("KW" + " " + new_line[3].replace(';', '') + "/." + "\n")
output.write("XX" + "\n")
output.write("CC" + " consensus - See RepBase for additional annotations." + "\n")
output.write("XX" + "\n")
output.write("CC" + " RepeatMasker Annotations:" + "\n")
output.write("CC" + " Type: " + new_line[3].replace(';', '') + "\n")
output.write("CC" + " SubType:" + "\n")
output.write("CC" + " Species: root" + "\n")
output.write("CC" + " SearchStages:" + "\n")
output.write("CC" + " BufferStages:" + "\n")
output.write("XX" + "\n")
output.write("RC" + "\n")
else:
output.write(line.replace('con','cnn')
###one-off correction for a 3-letter motif in satellite sequence
output.close()
I get this error when running RepeatMasker without -lib specified:
Checking for E. coli insertion elements
NCBIBlastSearchEngine::search: Error...compressed subject database (/home/hdd/4/RepeatMasker/Libraries//general/is.lib) does not exist!
at ./RepeatMasker line 2018.
WARNING: Retrying batch ( 1 ) [ 2,, 12131]...
RepeatMasker creates these .lib files in a temp directory during runtime, but I cannot figure out which field denotes a different .lib to be written to, or if other locations are being referenced.
Thanks for answering! Really appreciate your work on the Transposome program. I tried the
-nois
and-no_low
(forget the exact characters of that flag) and then it stalled out on creating the SINE lib. I was unaware of the formatting script!Thanks! So, did you get it working in the end? There should be an easy-to-use script to go from RepBase to RepeatMasker format. It must exist because they generate the files from RepBase (just not with every release). The RepeatMasker folks might have something like this if you ask.
Oh yeah, it worked great! I had no reason to stay with the embl format, I just had a case of tunnel vision.
Hi, @muppetleague - wondering if you remember how you managed to get "Repbase .embls to work as RepeatMasker .embl"...
Did you get it to support -species?
Did you get it so work without -no_is ?
Thanks!