Output fasta file with some sequences as the reverse complement
1
0
Entering edit mode
5.1 years ago
casey • 0

Hi all,

First time post and relatively new to bioinformatics but hoping to find a solution to my problem.

I am trying to write an awk script that input a fasta file containing a set of very similar sequences, some of them are from the negative strand while others are from the positive strand and hoping to output these sequences in the same direction. I know the direction of the strand is from the positive strand if the 9th position is "G" which if matched, would then replace the sequences with the reverse complement.

I dont have much as of yet as I thought i could pipe the output of Awk to revseq but I was unsure how to keep the headers

awk -F '' '$9 =="G"' | revseq

As a basic example: (note the headers of each sequence do begin with a >)

seq1
ACT
seq2
ATG
seq3
ATT

If 3rd position = "T" replace sequence with the reverse complement. so output would look like

Output:

seq1
AGT
seq2
ATG
seq3
AAT
genome sequence • 2.0k views
ADD COMMENT
1
Entering edit mode

Just a side note: It's complement, not compliment.

ADD REPLY
2
Entering edit mode
5.1 years ago
Jianyu ▴ 580

try this:

awk  '{if(NR%2) {print} else if(/[ATCG]{2}T[ATCG]*/) {system("echo "$0" | rev | tr ATCG TAGC")} else {print}}' test.fa

if you want to use 9th position to determine the direction, replace 2 with 8:

awk  '{if(NR%2) {print} else if(/[ATCG]{8}T[ATCG]*/) {system("echo "$0" | rev | tr ATCG TAGC")} else {print}}' test.fa
ADD COMMENT
0
Entering edit mode

This is exactly what I was chasing, Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2098 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6