How To Convert A Big Fasta File With Multi-Line Dna Sequences Into A Fasta File Of Reverse Complement Sequence
6
3
Entering edit mode
13.0 years ago
Hamilton ▴ 290

Hi,

I am trying to convert a big fasta file(as below seqID is started from 1 and upto ~20K) with multi-line DNA sequences into a fasta output format of reverse complement. how can i generate it? there are some available web tools but only one sequence or small file size is acceptable.

Any your input will be appreciated.

###
>1
GAGTTGTCCCATCCCTGGGAATGCATCAGAGGCAGGCAAGAGGATCAGATGGCAAAGTGA
TGAAATCAGAGACAAAGTGGGGTTAACTGAGAGCTCGCTCCCCCAGGCTTCTCTGACCCC
AGCTCCCTCCACTTAAAACACGGTGCCCATTGATGGCAGCTCTGTTTCATTTCCTGTTAT
TCC
>2
GCCACACTATGAGAGAGTTGCAGAGTAAAACGTCTATGTAAACATATCAGATGTATTACA
GTAATACCGTCTGTGCTTACTCAGTTCTCCTTGGCTCGGTTTTATGTACAAACTAAACTA
TTAGGATCCCTTTCATTGTTCCTTCTGGCAGCTCTGTCTAACTTCTG
>3
TTGGCAGTCATAGTGTGCGGCTGTGTCTGCCACTCCTCTGCAAACTCCCCAGCTAAGTAG
ATTGCTGTAAAGCGTGACTAATGAGGTCCTGTGAGCTGACCTAATGAAACCCAACCAGCA
GGGTCACCAGGGCCACCAAAGCGTGAGTCTGCATCTCCCTAGGCCGTGTAACCGGATCCG
CACTGGCTCTGGAGGGAGGGCCAGGAG
>4
TCCACAGGTCATCCATCACCTGGGAGGAAGCCATGCTAGCAGGTAACTGGACTCTCTGCT
CGGTGGCCCGCACACTCTCTGTCTTTCATCTCTGCCCAGCAAGCTGCGTGGGCTTTCAGT
GCCCAGTCATTAACTCTGGCCTT
>5
CAGAGTGTTAACTTACACATGTGGGCATATACATTACGCGGGGCGGATAACTCACTCCTA
ATTTACTGCTGTCTTCTTCTCCGTGTGGAGGATCTTGTTCCTGTCGACAAATTTGGGGAT
CTAATAACAAAGTAGCGGAAGGATTCATCAGGGCAGGTGTAATC
>6
GAGTAAAGTCGACCCCTCAAAGGCGAATACTCTACTGTTTACTTAACCCAACATGTAGGA
GGCACTAGCTGCCAGCAGAGCAGGGATGGAATAACCCTTGATCTGAAGGATTTACAGTGT
AGTGGGTGATACAGATATCCGCGTACACACCAGTGAGCTGACTG
>7
ATGTTTACCCGAGCCTTGAGTGGCAGGTAAATGGACGAGTGGGAGGCACGGCCAGTTTCT
GTTGTGTATCCAGTATAGAATGTCCTAGAGTGGCCAACTGTCCTCTGGATGTGTTATTGG
ATCCTTGAATTCTGATAGCTTCACCAG
>8
agtaagcagcaccctgcacggcctctgcacggcctctgcctcaggttgcaggcctgcctg
aggccctgtcctgccttcctttgaagatgacctgtcatgtggaactgagtgaaatcaatc
ctcttctccccaagctgctttgggtcatggtgttttatca

###

##

dna fasta • 18k views
ADD COMMENT
0
Entering edit mode

So I'm working on compiling a few programming exercises for people interested in bioinformatics ... and I like the solutions posted here ... I'm wondering if I could get a novice explanation on the motivation to get the reverse compliment sequence

ADD REPLY
6
Entering edit mode
13.0 years ago
Pablacious ▴ 630

You could either use BioPerl or BioPython for this task, rather than directly scripting the correspondence of bases and so on. It would probably be no more than 10 lines in any of these two.

For instance, from the BioPython tutorial (just change genbank for fasta for reading a fasta:

from Bio import SeqIO
for record in SeqIO.parse("ls_orchid.gbk", "genbank"):
    print record.id
    print record.seq.reverse_complement()

Now, if we want to save these reverse complements to a file, we’ll need to make SeqRecord objects. For this I think its most elegant to write our own function, where we can decide how to name our new records:

from Bio.SeqRecord import SeqRecord

def make_rc_record(record):
    """Returns a new SeqRecord with the reverse complement sequence."""
    return SeqRecord(seq = record.seq.reverse_complement(), \
                 id = "rc_" + record.id, \
                 description = "reverse complement")

We can then use this to turn the input records into reverse complement records ready for output. If you don’t mind about having all the records in memory at once, then the Python map() function is a very intuitive way of doing this:

from Bio import SeqIO
records = map(make_rc_record, SeqIO.parse("ls_orchid.fasta", "fasta"))
SeqIO.write(records, "rev_comp.fasta", "fasta")

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc63

ADD COMMENT
2
Entering edit mode
13.0 years ago
Neilfws 49k

When your data get too large for web-based tools, it's time to learn about locally-installed tools.

See EMBOSS revseq. Don't be confused by the documentation - it handles multiple sequence files.

ADD COMMENT
2
Entering edit mode
13.0 years ago

Biopieces are pretty ideal for stuff like this:

read_fasta -i data.fna | reverse_seq | complement_seq | write_fasta -o data_rc.fna -x

You can alias this to revcomp_seq as explained here.

ADD COMMENT
1
Entering edit mode
13.0 years ago
ADD COMMENT
0
Entering edit mode
13.0 years ago

Here is a quick and dirty python script for reverse complement:

import sys

inFile = open(sys.argv[1],'r')

nuc = {'A':'T','T':'A','G':'C','C':'G','K':'M','M':'K','R':'Y','Y':'R','S':'W','W':'W','B':'V','V':'B','H':'G','D':'C','X':'N','N':'N'}

def revComp(seq):
    rev = ''
    for i in range(len(seq) - 1,-1,-1):
        rev += nuc[seq[i]]

    return rev

header = ''
seq = ''
for line in inFile:
    if line[0] == ">":
        if header != '':
            print header
            print revComp(seq.upper())

        header = line.strip()
        seq = ''
    else:
        seq += line.strip()

print header
print revComp(seq.upper())

Save as yourName.py. Use by: python yourName.py yourSeq.fasta > revComp.fasta

edit*

Added ambiguity codes. It's probably better to go with the BioPython implementation by pablacious. Doesn't hurt to learn how these things work though.

ADD COMMENT
0
Entering edit mode

Doesn't handle ambiguity codes. Perhaps too quick and dirty :o)

ADD REPLY
0
Entering edit mode
8.1 years ago

Solution for Command line lovers. Considering the DNA sequences in single-line format in a multifasta file:

cat multifasta_file.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt

If your multifasta file is not in single-line format, you can transform your file to single-line before using the command above, like this:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' <multifasta_file.txt &gt;multifasta_file_singleline.txt<="" p="">

Then,

cat multifasta_file_SingleLine.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt

Hope it is useful for someone. It took me quite some time to come out with it.

ADD COMMENT
0
Entering edit mode

Hi Melendezmatias, I tried your first solution and got the following errors:

-bash: unexpected argument `>' to conditional binary operator
-bash: syntax error near `>*'

Any idea of how to overcome this? Thanks, -Rob

ADD REPLY

Login before adding your answer.

Traffic: 1615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6