In a previous question "Code golf: mean length of fasta sequences", Eric asked for some solutions to get the average length of the sequences in a fasta file
.
I tried to anwser this question using the following Erlang code:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
Compilation/Execution:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
this code seems to work fine for a small fasta file but it takes hours to parse uniprot_sprot.fasta (in fact , I pressed Ctr-C). Why ? I'm an Erlang newbie, can you improve this code ?
Pierre, in the mean time you should may be post your question in stackoverflow as well. Just a suggestion.
Fred, I'll do if I don't get an answer here :-)
posted on SO: http://stackoverflow.com/questions/3296855