To do things entirely on the command line, one approach is to download the RepeatMasker analysis for your genome of interest directly from ISB.
For example, for hg38
:
$ wget -qO- http://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz | gunzip -c > hg38.fa.out
Then convert this RepeatMasker analysis to BED with BEDOPS convert2bed
:
$ convert2bed --input=rmsk < hg38.fa.out > hg38.fa.out.bed
If you want broader repeat element category names as IDs in this file, use the following modification:
$ convert2bed --input=rmsk < hg38.fa.out | cut -f1-3,11 > hg38.fa.out.bed
This last conversion result puts the following keywords into the ID field of hg38.fa.out.bed
:
DNA
DNA/Kolobok
DNA/MULE-MuDR
DNA/Merlin
DNA/PIF-Harbinger
DNA/PiggyBac
DNA/TcMar
DNA/TcMar-Mariner
DNA/TcMar-Pogo
DNA/TcMar-Tc1
DNA/TcMar-Tc2
DNA/TcMar-Tigger
DNA/TcMar?
DNA/hAT
DNA/hAT-Ac
DNA/hAT-Blackjack
DNA/hAT-Charlie
DNA/hAT-Tag1
DNA/hAT-Tip100
DNA/hAT-Tip100?
DNA/hAT?
DNA?
DNA?/PiggyBac?
DNA?/hAT-Tip100?
LINE/CR1
LINE/Dong-R4
LINE/Jockey
LINE/L1
LINE/L1-Tx1
LINE/L2
LINE/Penelope
LINE/RTE-BovB
LINE/RTE-X
LTR
LTR/ERV1
LTR/ERV1?
LTR/ERVK
LTR/ERVL
LTR/ERVL-MaLR
LTR/ERVL?
LTR/Gypsy
LTR/Gypsy?
LTR?
Low_complexity
RC/Helitron
RC?/Helitron?
RNA
Retroposon/SVA
SINE/5S-Deu-L2
SINE/Alu
SINE/MIR
SINE/tRNA
SINE/tRNA-Deu
SINE/tRNA-RTE
SINE?/tRNA
Satellite
Satellite/acro
Satellite/centr
Satellite/telo
Simple_repeat
Unknown
rRNA
scRNA
snRNA
srpRNA
tRNA
If you want to do everything in one pass:
$ wget -qO- http://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz \
| gunzip -c \
| convert2bed --input=rmsk \
| cut -f1-3,11 \
> hg38.fa.out.bed
Use these kinds of streams where you can! It's a huge timesaver.
Once you have your RepeatMasker analysis as a BED file, you can do set operations with BEDOPS bedmap
and your regions-of-interest:
$ bedmap --echo --echo-map-id --delim '\t' regions.bed hg38.fa.out.bed > answer.bed
The regions.bed
file would be a sorted BED file containing regions-of-interest.
Regions-of-interest would be one of subsets of regions you want to investigate: exons, introns, 3'UTR, 5'UTR, upstream or downstream windows, etc.
The file answer.bed
will contain regions-of-interest and the the RepeatMasker repeat element category that overlaps that region in the last column.
In other words, you can pipe this answer.bed
file into awk
or other scripts to count the number of repeat element category hits you get for regions-of-interest, or do other downstream statistics.
Thanks Dear genomax , you mean i first in UCSC, repeat masker track, get bet output format as BED, and then take this file to BEDOPS to get the specific repeat like SINE, LINE ,.. for each region? is it true?