Question

Alphabetic Sort In Unix/Perl With Preference On Order Of Alphabets To Be Followed

2

Entering edit mode

14.4 years ago

Monzoor ▴ 300

sorting DNA sequences in unix is done in alphabetic order Is is possible to sort DNA sequences with a specified order of alphabets ?

unix perl sort dna sequence • 5.0k views

ADD COMMENT • link updated 14.3 years ago by Rvosa ▴ 580 • written 14.4 years ago by Monzoor ▴ 300

Ram · Answer 1 · 2010-12-20

5

Entering edit mode

14.4 years ago

Pierre Lindenbaum 166k

I'm not sure I understand your question. But if you want to sort for example using : C, A, T and G , I would use 'tr' to change the letters of the sequence. Something like

cat onesequenceperline.txt |\
tr "C" "0" | tr "A" "1" | tr "T" "2" | tr "G" "3" |\
sort |\
tr "0" "C" | tr "1" "A" | tr "2" "T" | tr "3" "G" > result.txt

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.4 years ago by Pierre Lindenbaum 166k

3

Entering edit mode

You can cut the number of processes down from 10 to 3 by removing the 'useless use of cat' and the redundant tr's:

tr 'CATG' '0123' &lt; onesequenceperline.txt | sort | tr '0123' 'CATG' > result.txt

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 14.4 years ago by biobot 0.0.77.a.1099 6.2k

2

Entering edit mode

Wow!. This is a simple yet effective idea. Somehow never struck me. I have to check how it scales for huge data sets. Any way, thanks a lot PL.

ADD REPLY • link 14.4 years ago by Monzoor ▴ 300

0

Entering edit mode

@Monzoot : very nice suggestion, thanks :-)

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Keith , very nice suggestion ! thanks ! :-)

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

Ram · Answer 2 · 2010-12-20

If you are asking how to sort alphabetically, a file like Pierre is imagining could be sorted alphabetically in perl like this:

perl -ane 'chomp;print sort {$a cmp $b} split(//,$_), "\n"' onesequenceperline.txt

Or in reverse order by switching $a and $b around (actually, in normal order the first argument to sort can be omitted so you could golf it down some more). An advantage of this is that it handles all IUPAC single nucleotide codes, but a disadvantage is that it doesn't let you define a custom ordering, as in Pierre's solution. If you want that, you will have to define a custom sort function, which won't fit neatly in a one-liner. Or at the very least a custom mapping, such as the %map hash, which achieves the same ordering as Pierre's, but sets all letters in the sequence to uppercase and checks to see if there are no unexpected letters (it dies if there are):

use strict;

my %map = (
    'C' => 0,
    'A' => 1,
    'T' => 2,
    'G' => 3,
);

while(<>) {
    chomp;
    print sort { $map{$a} <=> $map{$b} } grep { exists $map{$_} or die $_ } map { uc } split //;
    print "\n";
}

score 0 · Answer 3 · 2010-12-20

0

Entering edit mode

14.4 years ago

Spitshine ▴ 660

I am not sure I understand your question either but it sounds as if you could use a custom compare function in Perl to pass to your sort. (http://perldoc.perl.org/functions/sort.html)

ADD COMMENT • link 14.4 years ago by Spitshine ▴ 660