Question

Classifying repeat features from Ensembl

2

Entering edit mode

2.4 years ago

liorglic ★ 1.5k

I am working on a genomic analysis involving the evolution of repeat sequences in multiple species. To do that, I'm looking at repeat annotations from Ensembl. Unfortunately, there are lots of repeat types. For example, in the maize genome there are a total of 2,528 repeat types. Here are the 20 most common ones (along with the number of features):

dust    1354712
trf     1011673
RLC_opie_AC198173-5898  99893
RLX_ruda_AC202870-7495  74644
RLC_opie_AC201793-7083  72631
RLX_osed_AC191084-2931  71885
RLC_giepum_AC211251-11074       45512
RLC_opie_AC187207-1792  45084
HUCK1-I_ZM      38139
Gypsy-127_ZM-I  37880
RLG_xilon-diguus_AC203313-7774  33678
RLC_opie_AC197201-5474  30489
PREM2_ZM-int    28503
RLC_ji_AC213834-12382   27229
PREM2_ZM-LTR    26621
RLC_ji_AC211489-11215   26263
PREM1_ZM        24026
PREM1A_ZM_LTR   22852
RLX_iwik_AC203371-7824  21615
HUCK1-LTR_ZM    20066

Can somebody help with suggestions on how to classify these repeat types into several broad categories? More specifically, my questions are:

If you had to classify all repeats into 4-5 (or so) categories, what would they be? e.g. satellite, LTR, etc.
How would you go about transforming 2500 feature types into these 4-5 categories?
Are you aware of any previous work that did something similar? Does my suggested approach even makes sense?

BTW, I am aware of this documentation page, but did not find it very informative or useful since the definitions are rather loose.

Thanks!

ensembl repeat • 1.9k views

ADD COMMENT • link updated 14 months ago by e.r.zakiev ▴ 260 • written 2.4 years ago by liorglic ★ 1.5k

1

Entering edit mode

2.4 years ago

Felix ▴ 90

It sounds like a good approach to me, but I would think you will need more than 4-5 categories.

trf and dust are obviously just tandem repeat and low-complexity regions and the classification is based on the software used to detect them.

For the specific repeat types: In case your institut has access to RepBase you could query the names against it in an semi-automated way.

Otherwise there might be software to help with the classification, e.g. I can see COSEG a "program which automatically identifies repeat subfamilies".

ADD COMMENT • link 2.4 years ago by Felix ▴ 90

score 3 · Accepted Answer · 2023-01-03

Update: just in case somebody gets here in the future.
Turns out the repeat class is available from Ensembl, but AFAIK it is only exposed through the Perl, but not the REST API. Each repeat feature has a related "repeat consensus" entry, which in turn has the attributes "repeat_class" and "repeat_type", which contain the class information. For instance, here are the counts of repeats by type in the maize annotation:

1368363 Dust
    392 Low complexity regions
 915503 LTRs
   1883 RNA repeats
  10986 Satellite repeats
  16069 Simple repeats
1021919 Tandem repeats
 202445 Type II Transposons
  24523 Type I Transposons/LINE
   2745 Type I Transposons/SINE
2080702 Unknown

To fetch this information, I used the following Perl script (pardon my Perl coding...):

use strict;
use Bio::EnsEMBL::Registry;
use List::Util qw(min);

my ($species, $group, $out_bed) = @ARGV;
open(FH, '>', $out_bed) or die $!;

my $registry = 'Bio::EnsEMBL::Registry';

if ($group eq "vertebrates"){
$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org'
    -user => 'anonymous');
}
else{
$registry->load_registry_from_db(
    -host => 'mysql-eg-publicsql.ebi.ac.uk', # alternatively 'useastdb.ensembl.org'
    -port => 4157,
    -user => 'anonymous');
}

my $W = 10000000;

my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice',  );
my @slices = @{ $slice_adaptor->fetch_all('toplevel') };
foreach my $slice (@slices) {
    my $chrom = $slice->seq_region_name();
    my $chrom_length = $slice->length();
    my $s = 1;
    my $e = min($W, $chrom_length);
    while ($e <= $chrom_length){
        my $subslice = $slice->sub_Slice($s,$e);
        my @repeats = @{ $subslice->get_all_RepeatFeatures() };
        foreach my $repeat (@repeats) {
            my $rc = $repeat->repeat_consensus();
            printf FH ( "%s\t%d\t%d\t%s\t%s\t%s\t%s\n",
                    $chrom, $repeat->start(), $repeat->end(), $repeat->display_id(), $rc->repeat_class , $rc->repeat_type, $rc->name);
        }
        last if ($e == $chrom_length);
        $s = $e + 1;
        $e = min($s + $W, $chrom_length);
    }
}
close(FH);

I hope somebody finds this useful some day.