Question

Output fasta file with some sequences as the reverse complement

0

Entering edit mode

6.0 years ago

casey • 0

Hi all,

First time post and relatively new to bioinformatics but hoping to find a solution to my problem.

I am trying to write an awk script that input a fasta file containing a set of very similar sequences, some of them are from the negative strand while others are from the positive strand and hoping to output these sequences in the same direction. I know the direction of the strand is from the positive strand if the 9th position is "G" which if matched, would then replace the sequences with the reverse complement.

I dont have much as of yet as I thought i could pipe the output of Awk to revseq but I was unsure how to keep the headers

awk -F '' '$9 =="G"' | revseq

As a basic example: (note the headers of each sequence do begin with a >)

seq1
ACT
seq2
ATG
seq3
ATT

If 3rd position = "T" replace sequence with the reverse complement. so output would look like

Output:

seq1
AGT
seq2
ATG
seq3
AAT

genome sequence • 2.3k views

ADD COMMENT • link updated 6.0 years ago by Jianyu ▴ 580 • written 6.0 years ago by casey • 0

1

Entering edit mode

Just a side note: It's complement, not compliment.

ADD REPLY • link 6.0 years ago by Ram 45k

score 2 · Answer 1 · 2019-11-06

2

Entering edit mode

6.0 years ago

Jianyu ▴ 580

try this:

awk  '{if(NR%2) {print} else if(/[ATCG]{2}T[ATCG]*/) {system("echo "$0" | rev | tr ATCG TAGC")} else {print}}' test.fa

if you want to use 9th position to determine the direction, replace 2 with 8:

awk  '{if(NR%2) {print} else if(/[ATCG]{8}T[ATCG]*/) {system("echo "$0" | rev | tr ATCG TAGC")} else {print}}' test.fa