Mean Length Of Fasta Sequences
31
38
Entering edit mode
14.5 years ago

Hi,

I was about to re-invent the wheel, again, when I thought that there are probably many among you who had already solved this problem.

I have a few huge fasta files (half a million sequences) and I would like to know the average length of the sequences for each of these files, one at a time.

So, let's call this a code golf. The goal is to make a very short code to accomplish the following:

Given a fasta file, return the average length of the sequences.

The correct answer will go to the answer with the most votes on friday around 16h in Quebec (Eastern Time).

You can use anything that can be run on a linux terminal (your favorite language, emboss, awk...), diversity will be appreciated :)

Cheers

CONCLUSION:

And the correct answer goes to the most voted question, as promised :) Thank you all for your great participation! I think I am not the only one to have appreciated the great diversity in the answers, as well as very interesting discussions and ideas.

EDIT:

Although the question is now closed, do not hesitate to vote for all the answers you found interesting, especially those at the bottom! They are not necessarily less interesting. They often have only arrived later :)

Thanks again!

code fasta sequence codegolf • 37k views
ADD COMMENT
5
Entering edit mode

A series of "code golf" tournaments might be a fun way to seed the proposed "Project Mendel"

ADD REPLY
3
Entering edit mode

I love the answers to this question! Amazing diversity: we get the perl, python, R scripts then .... whoa ... flex, clojure, erlang, haskell, ocaml, memory mapped C files.

ADD REPLY
2
Entering edit mode

Create a 'flat' fasta for the human genome that for each chromosome contains the entire sequence as a single line. Now run the tools below and see which one can still do it.

ADD REPLY
2
Entering edit mode

This is one of the best posts in the forum! Opened my eyes to the full universe of programming approaches. Can I +1 code golf as well as this post?

ADD REPLY
1
Entering edit mode

I would be incredibly impressed to see an answer in BF, Var'aq or Lolcode ... not for usability but just to see a "real-word" application.

ADD REPLY
1
Entering edit mode

Kind of what I had in mind :) Something like doing 1 golf-code per week and slowly building up a set of fun problems. Maybe there should be and off-biostars group discussing problem propositions and formulation. For example forming a google group... What do you think?

ADD REPLY
24
Entering edit mode
14.5 years ago
Cassj ★ 1.3k

Bioconductor?

library(Biostrings)
s <- read.DNAStringSet(filename,"fasta")
sum(width(s)) / length(s)

@Stefano: Yeah, I think the above does just read everything in. Seems to be able to cope with pretty big files, but actually maybe this is better:

library(Biostrings)
s <- fasta.info(filename)
sum(s)/length(s)
ADD COMMENT
4
Entering edit mode

Fair enough, although I would argue that's it's the opposite of hacky to use an existing, well supported, open source library. Sure, it's useful to see how to implement solutions to a problem in different languages and in different ways, but it's also fairly handy to know if someone has already done all the work for you somewhere.

ADD REPLY
4
Entering edit mode

The benefit of using high level languages like R is specifically that they have libraries to help make these kind of tasks easy. Why force folks to re-implement a Fasta parser when a robust solution exists? The interesting thing about this post is the wide variety of approaches; you would take that away by putting restrictions on folks creativity.

ADD REPLY
3
Entering edit mode

Am I wrong or this code would read the whole file in memory? with Huge fasta file it might be problem

ADD REPLY
3
Entering edit mode

I'd point out that the original question uses the phrase "reinvent the wheel" and specifies any command-line solution, including emboss. Libraries do not "do everything". You need to know where to find them, how to install them and how to write code around them. These are core bioinformatics skills. Writing code from first principles, on the other hand, is a computer science skill. Personally, I favour quick, elegant and practical solutions to problems over computational theory.

ADD REPLY
1
Entering edit mode

Hmm, I think you're right. It seems to be able to cope with pretty big files though. Maybe this instead...?

library(Biostrings)
s<-fasta.info(filename)
sum(s)/length(s)
ADD REPLY
1
Entering edit mode

There are no rules, but if someone wants to use a library that does everything, then we don't have different approaches or code, we just have libraries and hacks. That's my own perspective.

ADD REPLY
1
Entering edit mode

Indeed, maybe next time I propose a code golf, I'll ask people to use their favorite language natively, w/o supplementary libraries. Not because it is functionally better, but maybe didactically preferable :) It also shows better how different approaches/paradigms/philosophies of programming give birth to a lot of diversity! Cheers

ADD REPLY
1
Entering edit mode

@CassJ I personally find nothing wrong with any of the answers posted here. I just have some thoughts that, for further questions, and in the context that these could be used to fuel a maybe-to-be-called Project Mendel, I may be inclined to ask people to use only with native code from their favorite language! Thanks for your answer :)

ADD REPLY
1
Entering edit mode

One day we won't need to code, we will have only libraries. I dream of that day, when other people can do 100% of the work for me.

ADD REPLY
0
Entering edit mode

I always forget about R and bioconductor

ADD REPLY
0
Entering edit mode

and littleR could make this command-line-able

ADD REPLY
0
Entering edit mode

Newish R can do it natively with the Rscript utility.

ADD REPLY
0
Entering edit mode

This could be a winner. Mean = stats, stats = R.

ADD REPLY
0
Entering edit mode

Not to mention that the "code" itself is just a hack, as it depends on a library. The real code would be to see the actual library function that reads the file and parse the data. I could do this in any language as long as I had a library that would do everything for me.

ADD REPLY
0
Entering edit mode

Ah, maybe I missed the point - I thought the OP was just looking for a quick way to solve the problem. Are the rules that you can only use core features of a language? In that case, I guess I'd go with Perl.

ADD REPLY
0
Entering edit mode

@Eric - Project Mendel is a grand idea :) If it's mainly focusing on bioinformatics algorithms then maybe there's a similar need for a bioinformatics CookBook with JFDI implementations of common tasks in different languages?

ADD REPLY
0
Entering edit mode

@CassJ This whole discussion about native code vs. use of modules led me to the very same conclusion earlier today. Namely, that 2 different projects were probably needed. I had, not seriously, thought it could be the "Mendel CookBook" :P More seriously, these 2 projects MUST come into existence. They would be a great asset for the developpement of the bioinformatics community among less computer-oriented people among biologists (and maybe as well among less biology-oriented people among informaticians :)

ADD REPLY
0
Entering edit mode

This version is almost twice longer than mine perl version and is even longer than perl version including invocation. LOL, is it still code golf contest?

ADD REPLY
0
Entering edit mode

wrong thread? :P

ADD REPLY
15
Entering edit mode
14.5 years ago

How about the following very simple AWK script?

awk '{/>/&&++a||b+=length()}END{print b/a}' file.fa

43 characters only (not counting file name). AWK is not known for its speed, though.

ADD COMMENT
0
Entering edit mode

Indeed, this is the slowest I tried so far! 18secs for the uniprot_sprot.fasta file. Funny, when I asked the question, I thought this approach, being 'closer to linux', would be the fastest. We learn everyday :)

ADD REPLY
0
Entering edit mode

If you want speed, just replace awk with mawk ;)

BTW I always wandered if maybe using > as record separator of awk would make more sense (since that's what it is anyway) for parsing fasta. In case of unfolded fasta, you could then set field separator as newline and take lengths of 2nd field and be done. But it get's more tricky with folded (i.e. multiline) fasta and I never quite managed to nail that. :/

ADD REPLY
14
Entering edit mode
14.5 years ago
Will 4.6k

Since it uses itertools and never loads anything into memory this should scale well for even giant fasta files.

from itertools import groupby, imap

tot = 0
num = 0
with open(filename) as handle:
    for header, group in groupby(handle, lambda x:x.startswith('>')):
        if not header:
            num += 1
            tot += sum(imap(lambda x: len(x.strip()), group))
result = float(tot)/num
ADD COMMENT
0
Entering edit mode

2.55 seconds for 400,000 sequences. Not bad! I think there is a mistake with the imap function. It should be map :) (I change it)

ADD REPLY
0
Entering edit mode

Actually its supposed to be imap but also need another import form itertools ... the nice thing about imap is that it won't load the whole iterator into memory at once, it will process the iterator as requested. This is REALLY useful when dealing with fasta files of contigs from many genomes.

ADD REPLY
0
Entering edit mode

I also corrected that the result should be a float so I put float(tot) instead of tot in the result.

ADD REPLY
0
Entering edit mode

one of these days I'll remember that Python doesn't auto-convert on division ... that gotcha has bitten me in the ass many times before, thank goodness for unit-testing.

ADD REPLY
12
Entering edit mode
14.5 years ago
Neilfws 49k

And just for completeness, a version using the BioRuby library:

#!/usr/bin/ruby
require "rubygems"
require "bio"
seqLen = []

Bio::FlatFile.open(ARGV[0]) do |ff|
  ff.each do |seq|
    seqLen << seq.length
  end
end

puts (seqLen.inject(0) { |sum, element| sum + element }).to_f / seqLen.size
ADD COMMENT
0
Entering edit mode

You can get rid of the (0) after inject, I think.

ADD REPLY
0
Entering edit mode

You could. It just sets an initial value for sum.

ADD REPLY
0
Entering edit mode

I voted +1 (regretfully) for your anwser Neil ;-)

ADD REPLY
0
Entering edit mode

Very gracious, I may return the favour :-)

ADD REPLY
12
Entering edit mode
14.5 years ago

Common Lisp version using cl-genomic (shameless self-promotion)

(asdf:load-system :cl-genomic)
(in-package :bio-sequence-user)

(time (with-seq-input (seqi "uniprot_sprot.fasta" :fasta :alphabet :aa
                      :parser (make-instance 'virtual-sequence-parser))
  (loop
     for seq = (next seqi)
     while seq
     count seq into n
     sum (length-of seq) into m
     finally (return (/ m n)))))

Evaluation took:
  9.301 seconds of real time
  9.181603 seconds of total run time (8.967636 user, 0.213967 system)
  [ Run times consist of 0.791 seconds GC time, and 8.391 seconds non-GC time. ]
  98.72% CPU
  22,267,952,541 processor cycles
  2,128,118,720 bytes consed

60943088/172805

Reading the other comments, use of libraries seems controversial, so in plain Common Lisp

(defun mean-seq-len (file)
  (declare (optimize (speed 3) (safety 0)))
  (with-open-file (stream file :element-type 'base-char
                          :external-format :ascii)
    (flet ((headerp (line)
             (declare (type string line))
             (find #\> line))
           (line ()
             (read-line stream nil nil)))
      (do ((count 0)
           (total 0)
           (length 0)
           (line (line) (line)))
          ((null line) (/ (+ total length) count))
        (declare (type fixnum count total length))
        (cond ((headerp line)
               (incf count)
               (incf total length)
               (setf length 0))
              (t
               (incf length (length line))))))))

(time (mean-seq-len "uniprot_sprot.fasta"))

Evaluation took:
  3.979 seconds of real time
  3.959398 seconds of total run time (3.803422 user, 0.155976 system)
  [ Run times consist of 0.237 seconds GC time, and 3.723 seconds non-GC time. ]
  99.50% CPU
  9,525,101,733 processor cycles
  1,160,047,616 bytes consed

60943088/172805
ADD COMMENT
1
Entering edit mode

Thanks, I was only dreaming of seeing a Common Lisp answer to this question :)

ADD REPLY
11
Entering edit mode
14.5 years ago
Neilfws 49k

Here is another R example, this time using the excellent seqinR package:

library(seqinr)
fasta <- read.fasta("myfile.fa")
mean(sapply(fasta, length))

And here's a way to run it from the command line - save this code as meanLen.R:

library(seqinr)
fasta <- read.fasta(commandArgs(trailingOnly = T))
mean(sapply(fasta, length))

And run it using:

Rscript meanLen.R path/to/fasta/file
ADD COMMENT
1
Entering edit mode

Another R hack, show me the function in the library and I will upvote it.

ADD REPLY
3
Entering edit mode

R is open source. Feel free to examine the library code yourself :-)

ADD REPLY
0
Entering edit mode

Great answer!

Here is an alternative using ape package, which I use way more often:

library(ape)
fasta <- read.dna(commandArgs(trailingOnly = T), "fasta")
mean(sapply(fasta, length))

Run it the same way as your script ;)

ADD REPLY
10
Entering edit mode
14.5 years ago

This seem to do it. straight from the command line.

Obviously, being perl, it is compact but a bit obfuscate (also in the reasoning ;))

perl -e 'while(<>){if (/^>/){$c++}else{$a+=length($_)-1}} print $a/$c' myfile.fa

It basically counts all nucleotides (in $a) and the number of sequences (in $c). At the end divides the number of nucleotides by the number of sequences, getting the average length.

$a+=(length($_)-1)

-1 because there is the newline to take into account. use -2 if you use a MS window file. Or chomp.

Does not require lot of memory, and it only need to read the file once.

ADD COMMENT
2
Entering edit mode
perl -le 'while(<>){/^>/?$c++:{$a+=length}} print $a/$c' myfile.fa
ADD REPLY
2
Entering edit mode
perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa
ADD REPLY
0
Entering edit mode

I see it uses a similar logic as the one posted by Will. Didn't see it, but he posted first

ADD REPLY
0
Entering edit mode

I actually pilfered the idea from a blog post ... I use it for all of my production code since it deals well with giant (or functionally infinite) fasta files.

ADD REPLY
0
Entering edit mode

Same thing but a few chars shorter...

perl -le 'while(<>){/^>/?$c++:{$a+=length}} print $a/$c' myfile.fa
ADD REPLY
0
Entering edit mode

and I thought perl code properly indented was tough to read ... that might just as well be in Wingdings to me ;)

ADD REPLY
0
Entering edit mode
perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa

I personally like the implicit while loop and use of and END block in command line versions such as this.

ADD REPLY
0
Entering edit mode
perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa

I personally like the implicit while loop and use of and END block in command line hacks such as this.

ADD REPLY
0
Entering edit mode

I personally like the use of the -n option plus an END block for this kind of command line version.

ADD REPLY
10
Entering edit mode
14.5 years ago
Yang ▴ 190

The perl script might be not so short but quite simple:

my($n, $length) = (0, 0);
while(<>) {
       chomp;
       if(/^>/){
           $n++;
       }else {
           $length += length $_;
       }
   }
  print $length/$n;
ADD COMMENT
2
Entering edit mode

In fact, code like this one bring me closer to the day where I will want to learn Perl! Thanks!

ADD REPLY
1
Entering edit mode

I LIKE to see perl written on many lines like that :) My eyes hurt less!

ADD REPLY
0
Entering edit mode

Hi! I used this perl script and it solved the issue of estimating the average length. However I also need to estimate the standard deviation of the sequence lengths in the fasta file. Does anybody knows how to add this estimation to the aforementioned script? Thank you in advance! :)

ADD REPLY
10
Entering edit mode
14.5 years ago
Mitch Skinner ▴ 660

Haskell Version

I'm a relative Haskell newbie, and I'm sure this could be written more compactly or more elegantly. It's fast, though; even faster than the FLEX version, in my hands. I did see some timing strangeness that I haven't figured out, though--I thought the flex version ran faster the first time I tried it, but now it's consistently slower when I try to run it. Also, I seem to get a different mean value with the perl version than with the others I've tried. So please do reproduce.

EDIT: As Pierre Lindenbaum points out, I forgot the "-f" flag to flex; the times below are updated with that flag.

[?]


Some comparisons, on my machine:

[?]

Thanks for the nice clean problem to learn with!

ADD COMMENT
1
Entering edit mode

Interesting. I had a look at haskell before. By the way, I compiled my flex version with "-f": the lexer is larger but faster.

ADD REPLY
1
Entering edit mode

You could also use the Haskell bioinformatics library, which provides Fasta parsing and, among other things, a seqlength function.

ADD REPLY
0
Entering edit mode

B.interact takes stdin and reads it into a string. And it does that lazily, so this code doesn't ever have the entire file in memory. B.lines separates that string into a list of lines (eliminating newlines); foldl' fastaCount (0, 0) folds the "fastaCount" function over the list of lines, accumulating a count of sequences and a sum of their lengths. showMean takes the resulting count and length and returns a string with the mean, and B.pack does a bit of type conversion (turns the regular string from "show" into the bytestring that B.interact expects)

ADD REPLY
0
Entering edit mode

Ah, the "-f" is what I forgot. With -f, the flex version takes 0.986 seconds real time on my machine (about 0.24 seconds slower than the haskell version)

ADD REPLY
0
Entering edit mode
import Bio.Sequence
main = do
  ls <- map seqlength `fmap` readFasta "input.fasta"
  print (sum ss `div` length ss)
ADD REPLY
9
Entering edit mode
14.5 years ago

34 chars, 45 including invocation.

time perl -nle'/^>/?$c++:($b+=length)}{print$b/$c' uniprot_sprot.fasta
352.669702844246

real    0m2.169s
user    0m2.048s
sys     0m0.080s
ADD COMMENT
8
Entering edit mode
14.5 years ago

Challenge accepted!

5 lines of Python (only 3 of actual work) that print the mean, given the fasta file as a command line argument. It is a bit obfuscated, but I went for brevity:

import sys
from Bio import SeqIO
handle = open(sys.argv[1], 'rU')
lengths = map(lambda seq: len(seq.seq), SeqIO.parse(handle, 'fasta'))
print sum(lengths)/float(len(lengths))

Disclaimer: I have no idea how this will scale. Probably not great.

ADD COMMENT
1
Entering edit mode

Simon, you can replace the reduce(lambda x,y: x+y, lengths) bit with just sum(lengths) to make the last line more straightforward.

ADD REPLY
0
Entering edit mode

You could save a line by putting the open() in the SeqIO.parse() call, but my soul rebelled at that point.

ADD REPLY
0
Entering edit mode

Thanks :) I understand. I tend to write longer code these days in order to be clearer. After all, I'm going to have to re-read this code sometime!

ADD REPLY
0
Entering edit mode

It took 6.6 seconds for 400,000 sequences. Not uber fast, but usable :)

ADD REPLY
0
Entering edit mode

Brad, of course, I always forget about sum()... Changed for the increased readability.

ADD REPLY
0
Entering edit mode

If it depends on an external library, I can do in one line.

ADD REPLY
0
Entering edit mode

Since the question was about not re-inventing the wheel, I decided not to do it with the fasta parsing. But since that's where the biggest performance hit is, I guess I could have done better implementing it myself.

ADD REPLY
8
Entering edit mode
14.5 years ago

Having made the shortest and slowest program, allow me to also present a longer but much faster solution in C that makes use of UNIX memory mapping to read the input file:

#include "sys/types.h"
#include "sys/stat.h"
#include "sys/mman.h"
#include "fcntl.h"
#include "stdio.h"
#include "unistd.h"

void main(int ARGC, char *ARGV[]) {

  int fd;
  struct stat fs;
  char *p1, *p2, *p3;
  unsigned long c1, c2;

  // Map file to memory.
  fd = open(ARGV[1], O_RDWR);
  fstat(fd, &fs);
  p1 = mmap(NULL, fs.st_size, PROT_READ, MAP_SHARED, fd, 0);
  p2 = p1;
  p3 = p1+fs.st_size;

  // Do the actual counting.
  c1 = 0;
  c2 = 0;
  while (p2 < p3 && *p2 != '>') ++p2;
  while (*p2 == '>') {
    ++c1;
    while (p2 < p3 && *p2 != '\n') ++p2;
    while (p2 < p3 && *p2 != '>') {
      if (*p2 >= 'A' && *p2 <= 'Z') ++c2;
      ++p2;
    }
  }

  // Print result.
  printf("File contains %d sequences with average length %.0f\n", c1, (float)c2/c1);

  // Unmap and close file.
  munmap(p1, fs.st_size);
  close(fd);

}

On my server it takes only 0.43s wall time (0.37s CPU time) to process uniprot_sprot.fasta

ADD COMMENT
0
Entering edit mode

maybe you can put the code to compile this program?

ADD REPLY
0
Entering edit mode
gcc -O3 -o avglength avglength.c
ADD REPLY
0
Entering edit mode

It's C, but it's not ANSI :-)

ADD REPLY
0
Entering edit mode

please enlighten me ... which part of it is not ANSI? (ok, apart from the // that should be /* */ if I remember right.)

ADD REPLY
0
Entering edit mode

please enlighten me - which part is not ANSI-C? (ok, I know that the // comments should be /* */)

ADD REPLY
0
Entering edit mode

unistd.h , open() , mmap(), sys/*.h, fcntl.h are not defined in the ANSI-C spec.

ADD REPLY
0
Entering edit mode

but I'm just being picky :-)

ADD REPLY
7
Entering edit mode
14.5 years ago

something like a copy+paste from a previous question. It won't be the shortest code, but it may be the fastest. The following code is a FLEX lexer, with the following rules

  • increment the number of sequences for each sequence line ( /^>.*n/ )
  • increment the total size for each symbol of the sequence
  • ignore the spaces
  • raise an error for the other cases:

%{
#include <stdio.h>
#include <stdlib.h>
static long sequences=0;
static long size=0L;
%}
%option noyywrap

%%
^>.*\n   {
     sequences++;
     }

^[A-Za-z]+  {size+=yyleng;}

[ \t\n] ;

.   {
    fprintf(stderr,"Illegal character  \"%s\".", yytext);
    exit(EXIT_FAILURE);
    }

%%

int main(int argc,char** argv)
    {
    yylex();
    if(sequences==0) return EXIT_FAILURE;
    printf("%f\n",(double)size/(double)sequences);
    return EXIT_SUCCESS;
    }

Compilation:

flex -f golf.l
gcc -O3 -Wall lex.yy.c

Test:

wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
gunzip uniprot_sprot.fasta.gz ##232Mo

time ./a.out < uniprot_sprot.fasta

352.669703

real    0m0.692s
user    0m0.572s
sys 0m0.120s
ADD COMMENT
1
Entering edit mode

Definitely faster:

time python mean_length.py uniprot_sprot.fasta

352.669702844
11.04s user 0.20s system 100% cpu 11.233 total
ADD REPLY
0
Entering edit mode

Thanks for testing Simon

ADD REPLY
7
Entering edit mode
14.5 years ago
Science_Robot ★ 1.1k

In Ruby!

a,b=0,[];File.new(ARGV[1]).each do |i|;if i[0]!=62;a+=i.length-1;else;b.push a;a=0;end;end;puts b.inject{|s,v|s+=v}/b.length.to_f

130 Chars, works w/ wrapped FASTA files.

Uncondensed:

a, b = 0, []
File.new(ARGV[0]).each do |i|
  if i[0]!=62
    a += i.length-1
  else
    b.push a
    a=0
  end
end;
puts b.inject{ |s, v| s+=v }/b.length.to_f

Just now learning Ruby so help me out.

ADD COMMENT
0
Entering edit mode

You're definitely not late, as the correct answer will be decided in 3 days and all solutions are welcomed! :) The assumption that the fasta files are not wrapped is probably not met most of the time, however. Cheers!

ADD REPLY
7
Entering edit mode
14.5 years ago

A Clojure version using BioJava to handle the Fasta parsing:

(import 'org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.java.io])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn lazy-avg [coll]
  "Collect the count and number of values in a collection in one pass.

  1 define the function that increments the sum and counts.
  2 reduce the collection, updating the sum and count, and assign to variables
  3 divide the sum by count, keeping it safe in case of 0 division
  "
  (let [f (fn [[s c] val] [(+ s val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
  (println
    (lazy-avg (apply fasta-to-lengths *command-line-args*))))

and a more specific implementation without using external libraries for FASTA:

(use '[clojure.java.io])
(use '[clojure.contrib.str-utils2 :only (join)])

(defn fasta-lengths [in-file]
  "Generate collection of FASTA record lengths, splitting at '>' delimiters"
  (->> (line-seq (reader in-file))
    (partition-by #(.startsWith ^String % ">"))
    (filter #(not (.startsWith ^String (first %) ">")))
    (map #(join "" %))
    (map #(.length ^String %))))

(defn lazy-avg [coll]
  "Collect the count and number of values in a collection in one pass.

  1 define the function that increments the sum and counts.
  2 reduce the collection, updating the sum and count, and assign to variables
  3 divide the sum by count, keeping it safe in case of 0 division
  "
  (let [f (fn [[s c] val] [(+ s val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
  (println
    (lazy-avg (fasta-lengths (first *command-line-args*)))))

I've been iterating over a couple versions of this to improve performance. This thread on StackOverflow has more details for those interested.

% time cljr run fasta_counter.clj uniprot_sprot.fasta PROTEIN 
60943088/172805
11.84s
ADD COMMENT
7
Entering edit mode
14.5 years ago

Erlang special golfing 213 chars version:

-module(g).
-export([s/0]).
s()->open_port({fd,0,1},[in,binary,{line,256}]),r(0,0),halt().
r(C,L)->receive{_,{_,{_,<<$>:8,_/binary>>}}}->r(C+1,L);{_,{_,{_,Line}}}->r(C,L+size(Line));_->io:format("~p~n",[L/C])end.

Readable but reliable version:

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<$>:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C]);
    X -> io:format("Unexpected: ~p~n",[X]),exit(bad_data)
  end.

Compile:

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl

Invocation:

$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s
ADD COMMENT
7
Entering edit mode
14.5 years ago
Tom Walsh ▴ 550

OCaml:

[?]

To give you some idea of the performance, the natively compiled version takes about 1.4s (real time) on my laptop vs 0.6 s for the C version posted by Lars Juhl Jensen.

ADD COMMENT
6
Entering edit mode
14.5 years ago

[?]

$totalLength   = 0;
$nbSequences   = 0;
$averageLength = 0;

$fastaFile = fopen("./database/uniprot_sprot.fasta", "r");
if ($fastaFile) {
    while (!feof($fastaFile)) {
        $line = fgets($fastaFile, 4096); 
        if(preg_match('/^>/', $line)) {
          $nbSequences++;
        }
        else {
          $totalLength += strlen(trim($line));
        }

    }
    fclose($fastaFile);
}

$averageLength = ($totalLength / $nbSequences);

print "Total sequences : " . $nbSequences   . "\n";
print "Total length    : " . $totalLength   . "\n";
print "Average length  : " . $averageLength . "\n";

Output :

[?]

[?]

[?]

ADD COMMENT
6
Entering edit mode
14.5 years ago
Science_Robot ★ 1.1k

Can I post more than one?

grep/awk:

grep '^[GATC].*' f | awk '{sum+=length($0)}END{print sum/NR}'

where f is the filename ;)

62 chars, btw.

ADD COMMENT
0
Entering edit mode

dangit... this too only works for unwrapped fasta files. shoot me!

ADD REPLY
5
Entering edit mode
14.5 years ago

Another answer. Last time I used FLEX, I now use the GNU BISON parser generator. Here I've created a simple grammar to describe a Fasta File.Of course this solution is slower than Flex :-)

%{

#include <stdio.h>
#include <ctype.h>

static long sequences=0L;
static long total=0L;
%}
%error-verbose 
        %union {
 int count;
}

%token LT
%token<count> SYMBOL
%token CRLF
%token OTHER

%start file

%{
void yyerror(const char* message)
{
fprintf(stderr,"Error %s\n",message);
}

int yylex()
{
int c=fgetc(stdin);
if(c==-1) return EOF;
if(c=='>') return LT;
if(c=='\n') return CRLF;
if(isalpha(c)) return SYMBOL;
return OTHER;
}
%}
%%

file:  fasta | file fasta;
fasta: header body { ++sequences; };
header: LT noncrlfs crlfs;
body: line |  body line;
line: symbols crlfs;
symbols: symbol {++total;}| symbols symbol {++total;};
symbol: SYMBOL;
crlfs: crlf | crlfs crlf;
crlf: CRLF;
noncrlfs:noncrlf | noncrlfs noncrlf;
noncrlf: SYMBOL| OTHER | LT;
%%

int main(int argc,char **argv)
{
yyparse();
if(sequences>0L) fprintf(stdout,"%f\n",(total/(1.0*sequences)));
return 0;
}

Compilation:

bison golf.y
gcc -Wall -O3 golf.tab.c

Test:

time ./a.out  < uniprot_sprot.fasta 
        352.669703

real0m9.723s
user0m9.633s
sys0m0.080s
ADD COMMENT
5
Entering edit mode
14.5 years ago
David W 4.9k

And now, with awk:

$awk '$1!~ /^>/ {total += length; count++} END {print total/count}' uniprot_sprot.fasta

Still not smoking fast but down to one line (for certain values of "one" and "line")

EDIT

And,as Pierre points out below, I got this wrong again, as it doesn't account for wrapped lines. This one does work:

$ awk '{if (/^>/) {record ++} else {len += length}} END {print len/record}' uniprot_sprot.fasta
ADD COMMENT
4
Entering edit mode

It returns the mean length of the fasta lines. We need the mean length of the whole fasta sequences.

ADD REPLY
0
Entering edit mode

right, should have worked that out really (i'd tested it on an unwrapped fasta file)

ADD REPLY
5
Entering edit mode
14.5 years ago

Using MYSQL, yes we can:

create temporary table T1(seq varchar(255));
LOAD data local infile "uniprot_sprot.fasta" INTO TABLE T1 (seq);
select @SEQ:=count(*) from T1 where left(seq,1)=">";
select @TOTAL:=sum(length(trim(seq))) from T1 where left(seq,1)!=">";
select @TOTAL/@SEQ;

Result:

me@linux-zfgk:~> time mysql -u root -D test < golf.sql
@SEQ:=count(*)
518415
@TOTAL:=sum(length(trim(seq)))
182829264
@TOTAL/@SEQ
352.669702844000000000000000000000

real    0m44.224s
user    0m0.008s
sys     0m0.008s
ADD COMMENT
4
Entering edit mode
14.5 years ago
David W 4.9k

Here's another python one, it's really the same as Simon's but using list expressions instead of lamdas (I find them more readable, you might not)

import sys
from Bio import SeqIO
lengths = [len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta")]
print sum(lengths)/float(len(lengths))

For SeqIO to accept filenames as strings (instead of handles) you need biopython 1.54. And I'm sure this will be slower than any of the other options already presented ;)

EDIT:

As Eric points out below this pumps all the lengths into one big list, which isn't exactly memory extensive ;)

I can hang on to my completely irrational fear of lamda/map and itertools with a generator expression:

lengths = (len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta"))
total= 0
for index, length in enumerate(lengths):
    total += length
print total/float(index + 1)

Still pretty slow, about a minute on computer, so probably not a goer for the original challenge!

ADD COMMENT
0
Entering edit mode

@david. Slow, I think, is not the problem here. For very large fasta files, your code is going to create a list of many thousand hundreds (or millions) of integers, which is going to be memory intensive. The best thing to do I guess would be to test it with the data that Pierre used for his test, the uniprot_sprot.fasta file :) Cheers

ADD REPLY
0
Entering edit mode

Oh, yeah, Read the specficication ...

Just added a generator expression version, interestingly enough the uniprot_sprot file doesn't make too much of a footprint with ~500 000 integers. But both methods are really too slow for what you want to do ;(

ADD REPLY
0
Entering edit mode

I guess I should have tested the memory footprint before putting the comment in the first place. A list of 50,000,000 integers will take about 1 Gb or so (just saturated the memory of my old faithful laptop trying).

ADD REPLY
4
Entering edit mode
14.5 years ago

A third variation, this time using Javacc (the Java Compiler Compiler):

options { STATIC=false;}

PARSER_BEGIN(Golf)
public class Golf
    {
    private int sequences=0;
    private int total=0;
    public static void main(String args[]) throws Exception
        {
        new Golf(new java.io.BufferedInputStreamSystem.in)).input();
        }
    }
PARSER_END(Golf)

TOKEN: /*RESERVED TOKENS FOR UQL */
{
      <GT: "="">">
   |  <CRLF: ("\n")+="">
   |  <SYMBOLS: (["A"-"Z",="" "a"-"z"])+="">
   |  <OTHER: (~["A"-"Z",="" "a"-"z","\n","="">"]) >
}

void input():{} { (fasta() )+ {System.out.println(total/(double)sequences);} }
void fasta():{} { header() (line())*}
void header():{} {  <GT> (<SYMBOLS>|<OTHER>|<GT>)* <CRLF> {sequences++;}}
void line():{Token t;} {  t=<SYMBOLS> <CRLF> { total+=t.image.length(); }}

Compile

javacc Golf.javacc
javac Golf.java

Test

time java Golf < uniprot_sprot.fasta 
352.6697028442464

real    0m11.014s
user    0m11.529s
sys 0m0.180s
ADD COMMENT
4
Entering edit mode
14.5 years ago
Tom Walsh ▴ 550

Perl 6. Painfully slow, so I'm probably doing something dumb that isn't obvious to me.

[?]

ADD COMMENT
0
Entering edit mode

one suggestion: in the case of multi-line FASTA files, the more common scenario is that any given line will NOT be a header line. thus, one speedup would be to test for "ne '>'" first in the while loop. not sure what speeds you're seeing but you could also try to "slurp" the file into memory first.

ADD REPLY
0
Entering edit mode

OK I'll give it a go. It's still much slower than Perl5 doing the same thing, i.e. reading line by line.

ADD REPLY
0
Entering edit mode

Maybe a grammar would be the best option? It is Perl 6...

ADD REPLY
0
Entering edit mode

I might give a grammar a go but I don't expect it'll help much. Even basic I/O - reading a file line by line = is still an order of magnitude slower than in Perl 5.

ADD REPLY
3
Entering edit mode
14.5 years ago

another [?]fastest[?] solution in 'C', with a lot of micro-optimization

#include <stdio.h>
#define BUFFER_SIZE 10000000
int main(int argc,char** argv)
    {
    size_t i;
    char buffer[BUFFER_SIZE];
    long total=0L;
    long sequences=0L;
    size_t nRead=0;
    size_t inseq=1;
    while((nRead=fread(buffer,sizeof(char),BUFFER_SIZE,stdin))>0)
        {
        for(i=0;i<nRead;++i)
            {
            switch(buffer[i])
                {
                case '>': if(inseq==1) ++sequences; inseq=0;break;
                case '\n': inseq=1;break;
                case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
                case 'G': case 'H': case 'I': case 'J': case 'K': case 'L':
                case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R':
                case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
                case 'Y': case 'Z':
                case 'a': case 'b': case 'c': case 'd': case 'e': case 'f':
                case 'g': case 'h': case 'i': case 'j': case 'k': case 'l':
                case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
                case 's': case 't': case 'u': case 'v': case 'w': case 'x':
                case 'y': case 'z': if(inseq==1) ++total;break;
                default: break;
                }
            }
        }
    printf("%f\n",(total/(1.0*sequences)));
    return 0;
    }

Compilation:

gcc -O3 -Wall golf.c

Result:

time ./a.out < uniprot_sprot.fasta 
352.669703

real    0m0.952s
user    0m0.808s
sys 0m0.140s
ADD COMMENT
1
Entering edit mode

Pierre, not to blow my own horn, but on my machine your C implementation is more than 2x slower than the one I posted ;)

ADD REPLY
1
Entering edit mode

but my solution is ANSI :-))

ADD REPLY
0
Entering edit mode

oh nooooooo :-)

ADD REPLY
0
Entering edit mode

Touché! I guess the price of standard compliance is 2-3x loss in speed? Sounds about right to me ;-)

ADD REPLY
0
Entering edit mode

Touché! I guess the price to pay for standards compliance is 2-3x loss in execution speed? Sounds about right to me ;-)

ADD REPLY
0
Entering edit mode

ab-so-lu-te-ly ! :-)

ADD REPLY
1
Entering edit mode
6.3 years ago

this is a nim solution

import strutils

var
  line: string
  seqCnt = 0
  charCnt = 0

proc main()=
  for line in stdin.lines:
    if line.startsWith('>'):
      inc seqCnt
    else:
      charCnt += len(line)
  echo (charCnt/seqCnt)
main()

compilation

nim compile golf.nim

result

time ./golf < uniprot_sprot.fasta
358.9318342665173

real    0m2.293s
user    0m0.892s
sys     0m0.147s

(uniprot is a big larger now so the answer is slightly different)

ADD COMMENT
1
Entering edit mode
3.8 years ago
jena ▴ 320

Using the seqmagick utility (an imagemagick-inspired interface to biopython):

seqmagick info *fasta

To get just the average lengths:

seqmagick info *fasta | awk '{OFS="\t"; print $5, $1}'
ADD COMMENT
1
Entering edit mode
3.2 years ago
jena ▴ 320

The (M)AWK way

Using a modified 2nd example from mawk manual:

awk '
    />/ {nseq++}
    !/>/ {chars += length($0)}
    END {print chars/nseq}
' input.fasta

The script counts all characters outside of header lines (i.e. the sequences) and divides this total by number of header lines (i.e. number of sequences) to get the mean length. The script works with multiline fasta (aka folded fasta).

ADD COMMENT
0
Entering edit mode

Initially, I tried to implement the same idea with grep and wc, but the number I got was off (higher by a few dozen). Then I remembered that the mawk manual shows how to reimplement wc in it and I realized that it can be easily modified to solve the problem here. I tried and it worked like a charm, giving the correct answer right away.

ADD REPLY
0
Entering edit mode

And now I noticed this solution uses basically the same logic as Lars' answer above :/

ADD REPLY
0
Entering edit mode
23 months ago
jena ▴ 320

My first Julia ever (no libs)

This is a rough translation of my mawk answer. I think this is the bare minimum Julia code to process some file:

# set globals with types (Julia 1.8+; else use `const`?)
nseq::Int = 0
len::Int = 0

# read STDIN
for line in eachline(stdin)
    if(occursin(">", line) == true)
        global nseq += 1
    else
        global len += length(line)
    end
end
println(len/nseq)

The script reads fasta line-by-line from standard input (stdin) and then checks for headers and accumulates counts and lengths. You call it like this:

julia meanfaslen.jl < file.fa

I've compared it against the awk solution, using human genome fastas for testing (I have hundreds of them on our cluster). I got these rough timings on a login node of the cluster (Intel Xeon, but lower spec than on compute nodes):

  • julia ~= 20 sec
  • gawk ~= 34 sec
  • mawk ~= 4 sec (!!)

I think it should be possible to optimize the code to beat the mawk, but I just started with Julia :)

ADD COMMENT
0
Entering edit mode
16 months ago
jena ▴ 320

bioawk

I realized there is no bioawk answer, so here we go:

bioawk -c fastx '{n++; l += length($seq)} END{print l/n}' input.fa.gz

Few notes:

  • bioawk uses kseq.h for parsing FASTA/Q files and is generally much faster than other awk versions (except mawk, which is really hard to beat!).
  • bioawk automatically parses the FASTA file into table with 3 columns (4 in case of FASTQ), so you can do all operations per line, without the need to handle headers vs sequences. It also means that for genomic DNA, you can have an entire chromosome in $seq - be careful with that! In my experience it was able to split chr1 of human into array of single characters within 32gb of RAM (seems to be a built-in limit of this awk).
  • bioawk supports compressed files, and not just gzip or bgzip - it handles even more exotic formats like rz (first-draft compression of bam files, before they switched to bgzip).
  • it supports several formats, like VCF, SAM, BED, GFF, FASTA/Q, as well as general tables with headers - columns can be referenced with $name, which makes the code look more readable. See format descriptons with bioawk -c help.
ADD COMMENT

Login before adding your answer.

Traffic: 2499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6