fasta - reverse complement sequence
5
3
Entering edit mode
8.6 years ago
mbk0asis ▴ 700

Hi.

I have a fasta file with multiple headers and want to get reverse complement sequences. I usually use FASTX-TOOLKIT, but I want to learn how to do with linux commands.

I tried it using 'awk' and 'if' something like below, but I got a tedious result.

awk '{if($1 ~ />/) print $0; else gsub("[ATCG]","[TAGC]");print $0}' TEST.fa

As you can see, the file has the same string in the header.

How can I get reverse complement sequences without altering the headers?

> ATCG_1
agtcgacATCGATtataggta
> ATCG_2
cgactgcagtcgacATCGACT

Thank you!

fasta reverse complement bash • 17k views
ADD COMMENT
2
Entering edit mode

This is the kind of thing one shouldn't bother doing with awk or similar tools. Sure, you can come up with a solution, but why bother?

ADD REPLY
0
Entering edit mode

I have this script for quick rev comp, but it assumes pure sequence as input (no headers).

#!/bin/bash
if [ ! -z "$1" ]; then
    echo "$1" | tr "[ATGCatgc]" "[TACGtacg]" | rev
else
    echo ""
    echo "usage: rev_comp_seq DNASEQUENCE"
    echo ""
fi
ADD REPLY
11
Entering edit mode
8.6 years ago

using 'tr', and 'rev', 2 lines per sequence

 cat input.fa | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done
ADD COMMENT
0
Entering edit mode

Thank you, Pierre! It works well. What exactly is the difference between $L and "$L"? How does it work?

ADD REPLY
0
Entering edit mode

$L and "$L" there is no difference while read L: the programs reads one line (the header), we print the line; we read the very next line into L; we print the last L, we reverse and translate it.

ADD REPLY
0
Entering edit mode

Thank you! I understood!

ADD REPLY
0
Entering edit mode

Thank! But I also have problem: that command can't work for sequencs with several lines.

ADD REPLY
0
Entering edit mode

linearize the sequence and make a 'two line' fasta

ADD REPLY
3
Entering edit mode
8.6 years ago
venu 7.1k

Assuming 2 lines per sequence (header and sequence), if not linearize and try something like following or modify more to get how you need

cat fasta.fa | paste - - | awk -F'\t' -vOFS='\t' '{gsub("A", "T", $2); gsub("T", "A", $2); gsub("G", "C", $2); gsub("C", "G", $2); print}' | tr '\t' '\n'
ADD COMMENT
1
Entering edit mode

Thank you for the answer! However, the code didn't work because gsubs serially change the sequences. Firstly. all A's are turned to T's, then second 'gsub' change them back to A's. So finally, all the sequences become a pool of A's and G's. Can I use gsub to change multiple strings at once?

ADD REPLY
1
Entering edit mode
8.6 years ago
iraun 6.2k

Perl?

perl -nle'BEGIN {
    @map{ A, a, C, c, G, g, T, t } = ( T, t, G, g, C, c, A, a )
    }
    print /^>/ ?
        $_ :
              join //, map $map{ $_ }, split //, scalar reverse
    ' file.fa
ADD COMMENT
0
Entering edit mode
8.6 years ago
chen ★ 2.5k

Try OpenGene (https://github.com/OpenGene/OpenGene.jl) to get reverse complement very easily with an operator ~

julia> using OpenGene

julia> seq = dna("AAATTTCCCGGGATCGATCGATCG")
dna:AAATTTCCCGGGATCGATCGATCG

julia> ~seq
dna:CGATCGATCGATCCCGGGAAATTT

ADD COMMENT
0
Entering edit mode
7.2 years ago
dugong • 0

Based on Pierre's answer here is a working solution for multi-line fasta.

Changes:

  • You do not lose undetermined sequences "N"-s
  • Works on multi line fasta: it escapes transformation on lines starting with '>'

As a 'one'-liner:

cat play.fa | while read L; do if [[ $L =~ ^'>' ]]; then echo $L; else echo $L | rev | tr "ATGC" "TACG" ; fi ; done

As a bash function

function _revcomplement.file { cat $1 | while read L; do if [[ $L =~ ^'>' ]]; then echo $L; else echo $L | rev | tr "ATGC" "TACG" ; fi ; done } ;

Call:

revcomplement.file play.fa

Tried on play.fa:

>MACHU
AGTCACCTTTACCCGGTTTCANNN
AGTCGCCTTTACCCGGTTTCA
CCCCCGGGGGGGGGGGGGGGG
AGTCACCTTTACCCGGTTTCA
AGTCACCTTTACCCGGTTTCA
AGTCGCCTTTACCCGGTTTCA
>PICCHU
ACTGCAGACACAACTACGGGGTTGTGGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCGCGGCGCGGGGCGGGGCCTGAGTCCCTGTGAGCTGGGAATCTGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAGAGAGAGAGAGCGCCATGTGTG
ACTGCAGACACAACTACGGGGTTGTGGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCGCGGCGCGGGGCGGGGCCTGAGTCCCTGTGAGCTGGGAATCTGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAGAGAGAGAGAGAGAGAGACAGAGAGACAGAGAGAGAGAGAGCGCCATCTGTGAGCATTTAGAATCCTCTCTATCCTGAGCAAGGA
AGTCGCCTTTACCCGGTTTCA

The .fa file has to end with a newline, otherwise the last line is not processed!

ADD COMMENT

Login before adding your answer.

Traffic: 2437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6