Question

Sort Sequences In A Fasta File According To The Sequence List In Another Fasta/Txt File

2

Entering edit mode

10.9 years ago

biostar ▴ 170

Hi Guys,

I have a fasta file with thousands of sequences and another text file for the sequence ids ( each id in a new line and they are the list of same sequences but in different order than the sequences in the fasta file). Could your guys please help me sort the sequences in fasta file according to the ID list in different text file? Thanks!!

.fasta file looks like this:

>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG
>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC

The .txt file for Seq IDs looks like this:

Seq1
Seq2
Seq3
Seq4

The expected result should be:

>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC
>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG

sequence fasta • 15k views

ADD COMMENT • link updated 2.6 years ago by sunnykevin97 ▴ 990 • written 10.9 years ago by biostar ▴ 170

0

Entering edit mode

Just to be clear: the IDs in the fasta file are integers, but in the related text file they are prefixed with "Seq"?

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.9 years ago by Neilfws 49k

0

Entering edit mode

Sorry, the IDs are both alpha-numeric and I corrected that above.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.9 years ago by biostar ▴ 170

Ram · Answer 1 · 2014-01-15

That was answered before, here, but anyway, why not to use an index for thousands of sequences?

use strict;
use Bio::Index::Fasta;

# file names
#
my $In_Fasta_File_Name = "test.fa";
my $List_File_Name     = "list.txt";

#
# make index
#
my $Index_File_Name = "tmp.idx";
my $idx             = Bio::Index::Fasta->new(
 '-filename'   => $Index_File_Name,
 '-write_flag' => 1
);
$idx->make_index($In_Fasta_File_Name);

#
# open the list
#
open( my $list, $List_File_Name ) or die "Could not open $List_File_Name !";

#
# write to stdout using list and index
#
my $out = Bio::SeqIO->new( '-format' => 'Fasta', '-fh' => \*STDOUT );
while ( my $id = <$list> ) {
 chomp $id;
 my $seq = $idx->fetch($id); 
 $out->write_seq($seq);
}

and the output

>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC
>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG

Ram · Answer 2 · 2014-01-14

0

Entering edit mode

10.9 years ago

AndreiR ▴ 260

A bash for loop?

for i in `cat .txt|sed 's/Seq//'`; do cat .fasta| grep ">$i" -A1 >> .sorted_fasta; done

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 10.9 years ago by AndreiR ▴ 260

2

Entering edit mode

It might be easier to

while read ID; do grep -A1 ">$ID" FASTA ; done < ID_FILE

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.9 years ago by PoGibas 5.1k

2

Entering edit mode

Tip: You can substantially speed up grep by using LC_ALL=C grep instead. See for example here.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.9 years ago by Philipp Bayer 8.7k

2

Entering edit mode

and by using -m 1 to avoid searching the rest of the file once the first (and only hit?) has been found. Notice also the ^ symbol to limit the search to the beginning of the line:

while read ID ; do grep -m 1 -A 1 "^>$ID" FASTA ; done < ID_FILE

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.9 years ago by Frédéric Mahé ★ 3.2k

1

Entering edit mode

A faster version:

cat $fasta | awk 'BEGIN{RS=">"; FS="\n"; curl=1;} NR>1{print $1 " " curl " " curl+NF-2; curl=curl+NF-1}' | sort -n -k 1,1 | while IFS=' ' read -ra TAB; do beg=${TAB[1]}; end=${TAB[2]}; sed -n $beg','$end'p' $fasta; done

It first generated lines like that : ID line_begin line_end. It sorts them by first column (ID) and for each of those sorted lines, it prints corresponding sequence with ID.

I also found out there was an amazingly fast tool named fastasort in exonerate suite. It's available in Debian/Ubuntu repositories or there : https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate

fastasort runs approximately 100x faster than my script.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 6.7 years ago by veyssier ▴ 10

0

Entering edit mode

There is a mistake in your grep:

"^>$ID$"

You need to do an exact match for the ID otherwise it will match with "22" when searching "2".

Moreover none of your solutions in bash work with multiline sequences. Here is a small working solution :

cat $fasta | grep "^>" | sort | while read ID ; do awk 'BEGIN{RS=">"; ORS="";} /^'${ID:1}'/{print ">" $0; exit(0);}' $fasta ; done

It sorts the ids and for each id, it extracts the sequence from fasta file.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 6.7 years ago by veyssier ▴ 10

0

Entering edit mode

This works more efficiently. http://seqanswers.com/forums/showthread.php?t=29558.

ADD REPLY • link 2.6 years ago by sunnykevin97 ▴ 990