Hi there,
I've been going round and round trying to figure out a way to redirect stdout to use with seqtk. I've read many posts that are similar, so I am confident the answer is out there, but since I'm having such a hard time I figured I might as well add my question and hopefully get some help!
I'm working with a bash script that extracts DNA sequences with a bed file and twoBitToFa . Using twoBitToFa to extract a sequence specified in a bed file and outputting the results to a fasta file is straightforward. I'd like to switch it up and instead redirect the sequence output back into the original bed file in a new column.
So, input bed file would look like this:
chr1 1125000 1128000
chr1 1130000 1131000
chr1 1137000 1138000
chr1 1142000 1143000
chr1 1144000 1145000
and the output would look like this:
chr1 1125000 1128000 GCGATCGATAGCTAGCTA
chr1 1130000 1131000 ATCTATATCCTAGATAGAT
chr1 1137000 1138000 GCGATCGATAGCTAGCTA
chr1 1142000 1143000 ATCTATATCCTAGATAGAT
chr1 1144000 1145000 GCGATCGATAGCTAGCTA
I am trying to skip the intermediate step of making a fasta file. So far, I have been able to output the sequence data in my terminal by using stdout
like so:
awk '{print $0" "$1":"$2"-"$3}' "input.bed" |
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \
`
Stdout in terminal looks like:
>chr1:1125000_1128000
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT
>chr1:1130000_1131000
ATCTATATCCTAGATAGAT
GCGATCGATAGCTAGCTA
>chr1:1137000_1138000
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT
>chr1:1142000_1143000
ATCTATATCCTAGATAGAT
GCGATCGATAGCTAGCTA
>chr1:1144000_1145000
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT
But I can't figure out how to redirect that stdout to a 4th column in the original input bed file without creating an intermediate fasta file. I know that seqkit accepts input data from standard input, but I can't find an example of it, nor can I get any of the similar solutions to work. Here are some things I've tried:
# saving sequence ids to a variable and matching them with twobit stdout
seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed) |
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
for i in seq; do grep $i stdin; done
# trying to use stdin with seqkit seq to join bed data with sequence data
awk '{print $0" "$1":"$2"-"$3}' "input.bed" |
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
seqkit seq awk '{print $0" "$1":"$2"-"$3}' "input.bed" stdin
# Another variation I found using seqtk with subseq
seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed") |
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
seqtk subseq stdin $seq > testbed.bed
# saving both things to variables to try and match
seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed") |
myvar=$(twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout) | \
seqkit grep -r -f $seq $myvar
This is quite a mess! How can I redirect output from twoBitToFa and add it to the original bed file without creating an intermediate fasta file?
Thanks in advance!
I would exploit the fact that
twoBitToFa
is giving you the intervals in the fasta name, and usetr
to swap a bunch of delimiters (newline -> tab;>
-> newline, etc.):Hi! This maybe "too simple" solution, but, if you are getting a sequence for each interval of the bed file, and the order of the output sequences is the same as in the input bed file, could you:
twoBitToFa -bed=input.bed -udcDir=. "$twobit" stdout | grep -v '>' | paste input.bed - > output.bed
I haven't tried myself, so leaving this as a comment.
This is so close! If the sequences were all on the same line, I think this would be good enough, but I failed to show that the sequence stdout is actually on multiple lines. I've updated the question to show this now. If I could delete the new line characters first, then this might work?