I am working on a genomic analysis involving the evolution of repeat sequences in multiple species. To do that, I'm looking at repeat annotations from Ensembl. Unfortunately, there are lots of repeat types. For example, in the maize genome there are a total of 2,528 repeat types. Here are the 20 most common ones (along with the number of features):
dust 1354712
trf 1011673
RLC_opie_AC198173-5898 99893
RLX_ruda_AC202870-7495 74644
RLC_opie_AC201793-7083 72631
RLX_osed_AC191084-2931 71885
RLC_giepum_AC211251-11074 45512
RLC_opie_AC187207-1792 45084
HUCK1-I_ZM 38139
Gypsy-127_ZM-I 37880
RLG_xilon-diguus_AC203313-7774 33678
RLC_opie_AC197201-5474 30489
PREM2_ZM-int 28503
RLC_ji_AC213834-12382 27229
PREM2_ZM-LTR 26621
RLC_ji_AC211489-11215 26263
PREM1_ZM 24026
PREM1A_ZM_LTR 22852
RLX_iwik_AC203371-7824 21615
HUCK1-LTR_ZM 20066
Can somebody help with suggestions on how to classify these repeat types into several broad categories? More specifically, my questions are:
- If you had to classify all repeats into 4-5 (or so) categories, what would they be? e.g. satellite, LTR, etc.
- How would you go about transforming 2500 feature types into these 4-5 categories?
- Are you aware of any previous work that did something similar? Does my suggested approach even makes sense?
BTW, I am aware of this documentation page, but did not find it very informative or useful since the definitions are rather loose.
Thanks!
Hello, liorglic and thank you for the interesting solution. The result is a
bed
file, correct? How do I generate a true.gtf
file instead, perhaps you could help me out?Hi there, and sorry for the late reply. It is not trivial to convert a bed file to gtf/gff as these formats contain additional information. I guess one could create some degenerate form though. If you provide an example of the expected output, I might be able to help.
Thank you for reaching out, liorglic!
The structure of the gtf files (for the purposes of velocyto, at least) is as follows:
Not sure what this
score
is though...Here is an example from the UCSC browser rmsk output: