Question

How to pull out sequences with repeat elements from RepeatMasker output file?

0

Entering edit mode

8.7 years ago

san.san ▴ 190

Hi all,

I've successfully run RepeatMasker with hard mask and soft mask parameters and have been asked to pull out sequences which have masked repeat elements.

I'm new to the command line and can only use grep, awk, etc. in a very basic way.

Would anyone be able to help me with this?

Thanks!

repeatmasker sequence sorting filtering • 5.9k views

ADD COMMENT • link 8.7 years ago by san.san ▴ 190

score 3 · Answer 1 · 2016-03-08

Since I'm working on a cluster and don't have bedtools installed or privileges to install it (either on the cluster or on my local machine) I came up with this work around:

1. Change the space-separated .out file into a tab-delimited file:

cat FILE.out | tr -s ' ' | sed 's/^ *//g' | tr ' ' '\t' > FILE.out.tab

2. Extract the fifth column with sequence names, get rid of the duplicates, then cut first three lines out (these are there from making the file tab-delimited):

cut -f5 FILE.out.tab | sort -u | tail -q -n +4 > repeat.sequence.names.list

3. Make your .masked file a one-line file for easier manipulation (you do have to type > sign in the second line):

sed '/>/s/$/</g' < FILE.masked | tr -d '\n' | tr '<' '\n'| sed 's/>/\

>/g' | grep . > FILE.masked.1

4. Use the one-line .masked file to pull out sequences with repeats:

grep -A1 -f repeat.transcripts.list FILE.masked.1 | grep -v "^--$" > masked.sequences.repeat.fasta

There's an option of using grep with multiple CPU cores if you have parallel installed. See how to here.

score 1 · Answer 2 · 2016-03-07

1

Entering edit mode

8.7 years ago

Devon Ryan 104k

Since you mention being familiar with awk:

Convert the repeatmasker text file output to a BED file (something like awk '{OFS="\t"}{print $6, $7-1, $8}', though you should check include the strand).
Use bedtools getfasta with a fasta file and the BED file from step one.

ADD COMMENT • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

I'm sorry, what do you mean by checking the strand? Thanks!

ADD REPLY • link 8.7 years ago by san.san ▴ 190

0

Entering edit mode

Look at the columns, one of them has a strand, which you might want to include.

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

By the way, RepeatMasker's .out file isn't a tab-delimited file, unfortunately. Otherwise I reckon I could cut and sort -u the .out file that contains the names of sequences with repeat elements and grep those from my .masked file :/

ADD REPLY • link 8.7 years ago by san.san ▴ 190

0

Entering edit mode

That's the benefit of awk, it'll handle the fixed-width nature of the file (granted, you could fix this with sed too).

ADD REPLY • link 8.7 years ago by Devon Ryan 104k

0

Entering edit mode

I used you kindly provided awk command on my .out file and ended up with this:

http://postimg.org/image/k0kt25hg7/

Which doesn't have the info I need :/ But I came up with another work around, so it's all good.

ADD REPLY • link 8.7 years ago by san.san ▴ 190