Redirect Stdout From twoBitToFa in Bash
1
0
Entering edit mode
23 months ago
Laura ▴ 50

Hi there,

I've been going round and round trying to figure out a way to redirect stdout to use with seqtk. I've read many posts that are similar, so I am confident the answer is out there, but since I'm having such a hard time I figured I might as well add my question and hopefully get some help!

I'm working with a bash script that extracts DNA sequences with a bed file and twoBitToFa . Using twoBitToFa to extract a sequence specified in a bed file and outputting the results to a fasta file is straightforward. I'd like to switch it up and instead redirect the sequence output back into the original bed file in a new column.

So, input bed file would look like this:

chr1    1125000 1128000
chr1    1130000 1131000
chr1    1137000 1138000
chr1    1142000 1143000
chr1    1144000 1145000

and the output would look like this:

chr1    1125000 1128000 GCGATCGATAGCTAGCTA
chr1    1130000 1131000 ATCTATATCCTAGATAGAT
chr1    1137000 1138000 GCGATCGATAGCTAGCTA
chr1    1142000 1143000 ATCTATATCCTAGATAGAT
chr1    1144000 1145000 GCGATCGATAGCTAGCTA

I am trying to skip the intermediate step of making a fasta file. So far, I have been able to output the sequence data in my terminal by using stdout like so:

awk '{print $0" "$1":"$2"-"$3}' "input.bed" | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \
`

Stdout in terminal looks like:

>chr1:1125000_1128000 
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT
>chr1:1130000_1131000 
ATCTATATCCTAGATAGAT
GCGATCGATAGCTAGCTA
>chr1:1137000_1138000 
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT
>chr1:1142000_1143000 
ATCTATATCCTAGATAGAT
GCGATCGATAGCTAGCTA
>chr1:1144000_1145000 
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT

But I can't figure out how to redirect that stdout to a 4th column in the original input bed file without creating an intermediate fasta file. I know that seqkit accepts input data from standard input, but I can't find an example of it, nor can I get any of the similar solutions to work. Here are some things I've tried:

# saving sequence ids to a variable and matching them with twobit stdout 

seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed) | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
for i in seq; do grep $i stdin; done


# trying to use stdin with seqkit seq to join bed data with sequence data

awk '{print $0" "$1":"$2"-"$3}' "input.bed" | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
seqkit seq awk '{print $0" "$1":"$2"-"$3}' "input.bed" stdin


# Another variation I found using seqtk with subseq

seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed") | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
seqtk subseq stdin $seq > testbed.bed


# saving both things to variables to try and match

seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed") |
myvar=$(twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout) | \
seqkit grep -r -f $seq $myvar

This is quite a mess! How can I redirect output from twoBitToFa and add it to the original bed file without creating an intermediate fasta file?

Thanks in advance!

stdout bed stdin twobittofa bash • 1.1k views
ADD COMMENT
0
Entering edit mode

I would exploit the fact that twoBitToFa is giving you the intervals in the fasta name, and use tr to swap a bunch of delimiters (newline -> tab; > -> newline, etc.):

awk '{print $0" "$1":"$2"-"$3}' "input.bed" | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout |
tr '\n' '\t' | tr '>' '\n' | tr ':' '\t' | tr '_' '\t'
ADD REPLY
0
Entering edit mode

Hi! This maybe "too simple" solution, but, if you are getting a sequence for each interval of the bed file, and the order of the output sequences is the same as in the input bed file, could you:

twoBitToFa -bed=input.bed -udcDir=. "$twobit" stdout | grep -v '>' | paste input.bed - > output.bed

I haven't tried myself, so leaving this as a comment.

ADD REPLY
0
Entering edit mode

This is so close! If the sequences were all on the same line, I think this would be good enough, but I failed to show that the sequence stdout is actually on multiple lines. I've updated the question to show this now. If I could delete the new line characters first, then this might work?

ADD REPLY
0
Entering edit mode
23 months ago
iraun 6.2k

Alright, so based on the new information, you have to linearize the fasta (from multi line to single line).

twoBitToFa -bed=input.bed -udcDir=. "$twobit" stdout | awk '/^>/ {printf("%s",(NR>1?"\n":""),$0);next;} {printf("%s",$0);} END {printf("\n")}' | paste input.bed - > output.bed

I don't have the option to test the command at the moment. I hope it works :).

ADD COMMENT

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6