Since I'm working on a cluster and don't have bedtools
installed or privileges to install it (either on the cluster or on my local machine) I came up with this work around:
1. Change the space-separated .out file into a tab-delimited file:
cat FILE.out | tr -s ' ' | sed 's/^ *//g' | tr ' ' '\t' > FILE.out.tab
2. Extract the fifth column with sequence names, get rid of the duplicates, then cut first three lines out (these are there from making the file tab-delimited):
cut -f5 FILE.out.tab | sort -u | tail -q -n +4 > repeat.sequence.names.list
3. Make your .masked file a one-line file for easier manipulation (you do have to type >
sign in the second line):
sed '/>/s/$/</g' < FILE.masked | tr -d '\n' | tr '<' '\n'| sed 's/>/\
>/g' | grep . > FILE.masked.1
4. Use the one-line .masked file to pull out sequences with repeats:
grep -A1 -f repeat.transcripts.list FILE.masked.1 | grep -v "^--$" > masked.sequences.repeat.fasta
There's an option of using grep
with multiple CPU cores if you have parallel
installed. See how to here.
FYI, you can always install software into your home directory (typically
~/bin
). You don't need elevated privileges for that.Ah, yes! Would have saved me a lot of time. Thanks!