Question

How to reverse the order of **records** in a fasta file

1

Entering edit mode

3.6 years ago

Apex92 ▴ 300

This might be easy but I have not been able to solve it - I have a normal fasta file as below:

>seq1
ATCTAC
>seq2
AATCGCATCG
>seq3
ATATACAGC
>seq1
ATCGCGGGC
>seq4
ATTAATTTTAT

what I want o do is to reverse the order of lines in the file. So the desired output should be:

>seq4
ATTAATTTTAT
>seq1
ATCGCGGGC
>seq3
ATATACAGC
>seq2
AATCGCATCG
>seq1
ATCTAC

I tried using tail -r file.fa but then it does not work because ids get placed after the sequences.

How can I do this?

Thank you

sequence pipeline programming fasta • 2.6k views

ADD COMMENT • link updated 3.6 years ago by cpad0112 21k • written 3.6 years ago by Apex92 ▴ 300

2

Entering edit mode

This is not reverse the order of lines in a fasta file but rather reverse the order of **records** in a fasta file.

ADD REPLY • link 3.6 years ago by GenoMax 147k

0

Entering edit mode

Can you tell us why you want to do this?

ADD REPLY • link 3.6 years ago by Sej Modha 5.3k

0

Entering edit mode

I loop on the sequences and save their kmers in a list and if any of the reverse-complement of kmers exist in any of the sequences then that sequence should be rejected. With that being said- there is a very low chance for the bottom sequences to get past the filter since the number of kmers increases exponentially. With reversing the order I want to apply the filter on the bottom sequences first.

ADD REPLY • link 3.6 years ago by Apex92 ▴ 300

4

Entering edit mode

3.6 years ago

5heikki 11k

Reverse the order of records of a fasta file:

tac -b -r -s "^>" in.fna > out.fna

Reverse the order of records of a fastq file:

tac -b -r -s "^@" in.fq > out.fq

These work even if there are linebreaks in the sequences..

ADD COMMENT • link 3.6 years ago by 5heikki 11k

0

Entering edit mode

elegant solution.

ADD REPLY • link 3.6 years ago by cpad0112 21k

0

Entering edit mode

Nice those tac options are new to me ! it will work in most/all cases. But you should never trust '>' as a separator :-)

https://nsaunders.wordpress.com/2014/08/14/looking-for-in-all-the-wrong-places/

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Those examples show uses of > within the header line, but record separators appear to either use foo.startswith('>') or the regex ^> to look for the delimiter at the start of a line, no?

ADD REPLY • link 3.6 years ago by Alex Reynolds 35k

2

Entering edit mode

3.6 years ago

cpad0112 21k

$ paste - - < test.fa | tac | tr "\t" "\n"                                                                                                                                           

>seq4
ATTAATTTTAT
>seq1
ATCGCGGGC
>seq3
ATATACAGC
>seq2
AATCGCATCG
>seq1
ATCTAC

ADD COMMENT • link 3.6 years ago by cpad0112 21k

score 3 · Accepted Answer · 2021-03-30

3

Entering edit mode

3.6 years ago

Pierre Lindenbaum 164k

awk '/^>/ {printf("%s%d\t%s\t",(N>0?"\n":""),N,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fasta  | sort -t $'\t' -k1,1nr | cut -f 2- | tr "\t" "\n"

ADD COMMENT • link 3.6 years ago by Pierre Lindenbaum 164k

score 2 · Accepted Answer · 2021-03-30

Another way this can be done is to read the FASTA file from its end to its start.

Reading the file backwards avoids having to use hash tables or sorting, which can be expensive on very large inputs. This approach will be fast and use very little memory:

#!/usr/bin/env python

import sys
import os

ifn = sys.argv[1] if len(sys.argv) == 2 else None

def reversed_lines(file):
    "Generate the lines of file in reverse order."
    part = ''
    for block in reversed_blocks(file):
        for c in reversed(block):
            if c == '\n' and part:
                yield part[::-1]
                part = ''
            part += c
    if part: yield part[::-1]

def reversed_blocks(file, blocksize=4096):
    "Generate blocks of file's contents in reverse order."
    file.seek(0, os.SEEK_END)
    here = file.tell()
    while 0 < here:
        delta = min(blocksize, here)
        here -= delta
        file.seek(here, os.SEEK_SET)
        yield file.read(delta)

def main():
    ifh = sys.stdin if not ifn else open(ifn, "r")
    lines = reversed_lines(ifh)
    sequence = ""
    header = ""
    for line in lines:
        if line.startswith('>'):
            header = line.rstrip()
            sys.stdout.write('{}\n{}\n'.format(header, sequence))
            sequence = ""
        else:
            sequence = line.rstrip() + sequence
    ifh.close()

if __name__ == "__main__":
    main()

Usage:

$ ./reverse_fasta_records.py < in.fa > out.fa

Or:

$ ./reverse_fasta_records.py in.fa > out.fa

Edit: Modified this script to support the input file either coming from standard input or a regular file.