Question

Repeat elements, SINEs, LINEs, LTRs in specific regions of gene

0

Entering edit mode

6.8 years ago

Kian ▴ 50

Hi I have a list of more than 1000 genes, i want to calculate repeat elements like SINE, LINE, LTR frequency of these genes in several region, like in exone, intron, 3utr, 5utr, upstream, downstream. and in the specific region how many there are LINE, SINE and LTRs.

chrom   strand  Start       End         LINE     SINE 
chr4    +       5104898     524438      86.00    80
chr4    +       11912008    11924714      1      20

repeatmasker sine line repeat elements ucsc • 5.1k views

ADD COMMENT • link updated 6.8 years ago by Alex Reynolds 36k • written 6.8 years ago by Kian ▴ 50

score 3 · Answer 1 · 2018-04-29

3

Entering edit mode

6.8 years ago

GenoMax 149k

You can get the repeatmasker track as a BED file and then intersect with you list using BEDtools or BEDOPS.

ADD COMMENT • link 6.8 years ago by GenoMax 149k

0

Entering edit mode

Thanks Dear genomax , you mean i first in UCSC, repeat masker track, get bet output format as BED, and then take this file to BEDOPS to get the specific repeat like SINE, LINE ,.. for each region? is it true?

ADD REPLY • link 6.8 years ago by Kian ▴ 50

score 2 · Answer 2 · 2018-04-29

To do things entirely on the command line, one approach is to download the RepeatMasker analysis for your genome of interest directly from ISB.

For example, for hg38:

$ wget -qO- http://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz | gunzip -c > hg38.fa.out

Then convert this RepeatMasker analysis to BED with BEDOPS convert2bed:

$ convert2bed --input=rmsk < hg38.fa.out > hg38.fa.out.bed

If you want broader repeat element category names as IDs in this file, use the following modification:

$ convert2bed --input=rmsk < hg38.fa.out | cut -f1-3,11 > hg38.fa.out.bed

This last conversion result puts the following keywords into the ID field of hg38.fa.out.bed:

DNA
DNA/Kolobok
DNA/MULE-MuDR
DNA/Merlin
DNA/PIF-Harbinger
DNA/PiggyBac
DNA/TcMar
DNA/TcMar-Mariner
DNA/TcMar-Pogo
DNA/TcMar-Tc1
DNA/TcMar-Tc2
DNA/TcMar-Tigger
DNA/TcMar?
DNA/hAT
DNA/hAT-Ac
DNA/hAT-Blackjack
DNA/hAT-Charlie
DNA/hAT-Tag1
DNA/hAT-Tip100
DNA/hAT-Tip100?
DNA/hAT?
DNA?
DNA?/PiggyBac?
DNA?/hAT-Tip100?
LINE/CR1
LINE/Dong-R4
LINE/Jockey
LINE/L1
LINE/L1-Tx1
LINE/L2
LINE/Penelope
LINE/RTE-BovB
LINE/RTE-X
LTR
LTR/ERV1
LTR/ERV1?
LTR/ERVK
LTR/ERVL
LTR/ERVL-MaLR
LTR/ERVL?
LTR/Gypsy
LTR/Gypsy?
LTR?
Low_complexity
RC/Helitron
RC?/Helitron?
RNA
Retroposon/SVA
SINE/5S-Deu-L2
SINE/Alu
SINE/MIR
SINE/tRNA
SINE/tRNA-Deu
SINE/tRNA-RTE
SINE?/tRNA
Satellite
Satellite/acro
Satellite/centr
Satellite/telo
Simple_repeat
Unknown
rRNA
scRNA
snRNA
srpRNA
tRNA

If you want to do everything in one pass:

$ wget -qO- http://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz \
    | gunzip -c \
    | convert2bed --input=rmsk \
    | cut -f1-3,11 \
    > hg38.fa.out.bed

Use these kinds of streams where you can! It's a huge timesaver.

Once you have your RepeatMasker analysis as a BED file, you can do set operations with BEDOPS bedmap and your regions-of-interest:

$ bedmap --echo --echo-map-id --delim '\t' regions.bed hg38.fa.out.bed > answer.bed

The regions.bed file would be a sorted BED file containing regions-of-interest.

Regions-of-interest would be one of subsets of regions you want to investigate: exons, introns, 3'UTR, 5'UTR, upstream or downstream windows, etc.

The file answer.bed will contain regions-of-interest and the the RepeatMasker repeat element category that overlaps that region in the last column.

In other words, you can pipe this answer.bed file into awk or other scripts to count the number of repeat element category hits you get for regions-of-interest, or do other downstream statistics.