Sort Sequences In A Fasta File According To The Sequence List In Another Fasta/Txt File
2
2
Entering edit mode
10.9 years ago
biostar ▴ 170

Hi Guys,

I have a fasta file with thousands of sequences and another text file for the sequence ids ( each id in a new line and they are the list of same sequences but in different order than the sequences in the fasta file). Could your guys please help me sort the sequences in fasta file according to the ID list in different text file? Thanks!!

.fasta file looks like this:

>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG
>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC

The .txt file for Seq IDs looks like this:

Seq1
Seq2
Seq3
Seq4

The expected result should be:

>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC
>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG
sequence fasta • 15k views
ADD COMMENT
0
Entering edit mode

Just to be clear: the IDs in the fasta file are integers, but in the related text file they are prefixed with "Seq"?

ADD REPLY
0
Entering edit mode

Sorry, the IDs are both alpha-numeric and I corrected that above.

ADD REPLY
1
Entering edit mode
10.9 years ago
Pavel Senin ★ 1.9k

That was answered before, here, but anyway, why not to use an index for thousands of sequences?

use strict;
use Bio::Index::Fasta;

# file names
#
my $In_Fasta_File_Name = "test.fa";
my $List_File_Name     = "list.txt";

#
# make index
#
my $Index_File_Name = "tmp.idx";
my $idx             = Bio::Index::Fasta->new(
 '-filename'   => $Index_File_Name,
 '-write_flag' => 1
);
$idx->make_index($In_Fasta_File_Name);

#
# open the list
#
open( my $list, $List_File_Name ) or die "Could not open $List_File_Name !";

#
# write to stdout using list and index
#
my $out = Bio::SeqIO->new( '-format' => 'Fasta', '-fh' => \*STDOUT );
while ( my $id = <$list> ) {
 chomp $id;
 my $seq = $idx->fetch($id); 
 $out->write_seq($seq);
}

and the output

>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC
>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG
ADD COMMENT
0
Entering edit mode

Thanks Pavel! Thanks to everyone who contributed.

ADD REPLY
0
Entering edit mode
10.9 years ago
AndreiR ▴ 260

A bash for loop?

for i in `cat .txt|sed 's/Seq//'`; do cat .fasta| grep ">$i" -A1 >> .sorted_fasta; done
ADD COMMENT
2
Entering edit mode

It might be easier to

while read ID; do grep -A1 ">$ID" FASTA ; done < ID_FILE
ADD REPLY
2
Entering edit mode

Tip: You can substantially speed up grep by using LC_ALL=C grep instead. See for example here.

ADD REPLY
2
Entering edit mode

and by using -m 1 to avoid searching the rest of the file once the first (and only hit?) has been found. Notice also the ^ symbol to limit the search to the beginning of the line:

while read ID ; do grep -m 1 -A 1 "^>$ID" FASTA ; done < ID_FILE
ADD REPLY
1
Entering edit mode

A faster version:

cat $fasta | awk 'BEGIN{RS=">"; FS="\n"; curl=1;} NR>1{print $1 " " curl " " curl+NF-2; curl=curl+NF-1}' | sort -n -k 1,1 | while IFS=' ' read -ra TAB; do beg=${TAB[1]}; end=${TAB[2]}; sed -n $beg','$end'p' $fasta; done

It first generated lines like that : ID line_begin line_end. It sorts them by first column (ID) and for each of those sorted lines, it prints corresponding sequence with ID.

I also found out there was an amazingly fast tool named fastasort in exonerate suite. It's available in Debian/Ubuntu repositories or there : https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate

fastasort runs approximately 100x faster than my script.

ADD REPLY
0
Entering edit mode

There is a mistake in your grep:

"^>$ID$"

You need to do an exact match for the ID otherwise it will match with "22" when searching "2".

Moreover none of your solutions in bash work with multiline sequences. Here is a small working solution :

cat $fasta | grep "^>" | sort | while read ID ; do awk 'BEGIN{RS=">"; ORS="";} /^'${ID:1}'/{print ">" $0; exit(0);}' $fasta ; done

It sorts the ids and for each id, it extracts the sequence from fasta file.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6