Question

Digesting Fasta Sequences Into A Set Of Smaller Sequences

16

Entering edit mode

14.7 years ago

Eric Normandeau 11k

Hi,

I had to come up with a code last week to solve this small problem and thought it may be another nice idea for a code golf.

Here are the requirements:

Given a fasta file (potentially huge), return a fasta file containing smaller sequences of length 'n' that represent all non-overlaping subsequences of the original sequences. Rename the new sequences according to the scheme 'oldSeqName_00001" and up, where 'oldSeqName' is the original name for the sequence.

Please use this test file.

Example run :

digest_fasta(fasta_in, 60, fasta_out)

inputs:

a number of nucleotides for each sub-sequence (n)
a fasta file

EXAMPLE:

XYZ_gene CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA GGAGACGGAGAGACACTTCATCATGCAGG

outputs: - a fasta file

EXAMPLE (after digesting in 60pb long fragments):

XYZ_gene_00001 CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT XYZ_gene_00002 CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA XYZ_gene_00003 GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC XYZ_gene_00004 ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT XYZ_gene_00005 GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA XYZ_gene_00006 CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA XYZ_gene_00007 GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA XYZ_gene_00008 GGAGACGGAGAGACACTTCATCATGCAGG

And so on for all the sequences in the 'fasta_in' file!

For this week's requirement, I'll try to be clearer: The program should run under linux.

That's it!

Use whatever language, approach, library you favor! Submit multiple answers if they are significantly different!

I'll accept the most voted answer on thursday around 16h Eastern Canada. Do vote for any answer you find interesting, even if they come in late!

Just like last week, diversity will be very appreciated!

Can't wait to see you code ;)

CONCLUSION: Thanks to all who participated with their answers or discussions! It's a pleasure to see all the different approaches and learn new tricks :)

Cheers

code fasta sequence codegolf • 21k views

ADD COMMENT • link updated 6.6 years ago by zx8754 12k • written 14.7 years ago by Eric Normandeau 11k

2

Entering edit mode

Dear whoever's downvoting without saying why -- it'd be much more constructive to find out why you didn't like a particular solution

ADD REPLY • link 14.7 years ago by Andrew Clegg ▴ 310

1

Entering edit mode

Yes keep deleting comments that are addressed or obsolete.

ADD REPLY • link 14.7 years ago by Istvan Albert 102k

0

Entering edit mode

An example (and canonical) input file would probably help with this one.

ADD REPLY • link 14.7 years ago by Simon Cockell 7.4k

0

Entering edit mode

An example would indeed be appreciated. I'm not 100% sure what we are supposed to do this time.

ADD REPLY • link 14.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

ok, on the way :)

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

I think it would be great to attach a bounty to these questions - a little more incentive to work hard for the best answer!

ADD REPLY • link 14.7 years ago by User 59 13k

0

Entering edit mode

Can we specify a common big fasta source? (This, perhaps?)

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

That's a nice idea :)

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Can we send the code on http://gist.github.com/ ?

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I suggest this as the "potentially huge fasta file"

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Not sure to follow you @Pierre. You mean INSTEAD of here?

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

@Daniel Swan, I think the question has to be unanswered for a few days before you put a bounty...

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

@eric, I first glance I thought the code could be large. I was wrong, please ignore my comment about gist.

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@eric I also removed dbsnp as the test file

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre. Please edit the question to put your suggested test file :)

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

done. @Eric so ''stdin'' shouldn't be handled isn't it ? :-)

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Note: in the fasta id, should the length parameter be adjusted? As here:

XYZ_gene (length = 60)_00001

or should the id be unchanged? I would opt for the latter.

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Michael 55k

0

Entering edit mode

nope, that was only for example purpose, I'll delete it...

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Just a random thought: Do you people mind if I delete some comments as issues get cleared and the comments become less usefull just to keep the list readable?

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

out of curiosity, do you have Matlab ands its associated bioinformatics toolbox?

ADD REPLY • link 14.7 years ago by Will 4.6k

0

Entering edit mode

@Will negative for Matlab. I haven't toyed with it since my 'engineer' in training days, about 10 years ago :) It seems to me that with so many viable open and free solutions out there, Matlab is not a first choice for many people doing research (thinking mainly of the grad students here). What is there that makes Matlab an interesting tool in you opinion?

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Just out of interest, what are you using the n-mers in the output file for?

ADD REPLY • link 14.7 years ago by Cassj ★ 1.3k

0

Entering edit mode

@CassJ Just one example. I could have long contigs (10 t0 50 kpb) representing assembly of 454 data sequenced from a 100+ kbp insert in which I know that there is a gene of interest, maybe in multible copies, but also possibly other genes. I can use 500 pb sub-sequences and blast them against a db (nr, swissprot, a custom DNA db) and try to identify the position of that gene.

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Code golf: it's learning, and advocacy, and showing off all rolled into one! What programmer wouldn't love that?

Last week you suggested discussing problems on a google group. I went to google groups and searched for "project mendel" and saw the existing thread on the biostar-central group. I thought maybe it wouldn't be neighborly (to the people participating in the current thread) to start a separate thread elsewhere, but on the other hand I don't know if biostar-central really wants a big question thread. Of course, that's assuming that we'd get enough people to make a big thread!

ADD REPLY • link 14.7 years ago by Mitch Skinner ▴ 660

0

Entering edit mode

Plus, I've been thinking about my own work and trying to imagine good questions, but so far I haven't thought of many problems that are the right size. Maybe one so far.

ADD REPLY • link 14.7 years ago by Mitch Skinner ▴ 660

0

Entering edit mode

how to install digest_fasta in linux ubuntu?? Please kindly suggest me.. thanking you

ADD REPLY • link 6.1 years ago by tamil05surya • 0

Ram · Answer 1 · 2010-07-26

15

Entering edit mode

14.7 years ago

Andrew Clegg ▴ 310

GNU Coreutils version, for single-sequence file only:

grep -v '^>' chr1.fa | tr -d '\n' | fold -w 60 | nl -n rz -s '
' | sed 's/^/>fragment_/;N' > output

The naming convention's not the same as the original question, but that's true of a lot of these :-)

Speed:

real    0m17.182s
user    0m19.255s
sys     0m1.634s

Memory usage: practically 0, it's all done via streams.

Multi-sequence version later, maybe.

EDITED to change first regexp from > to ^> which shaved about a second off.

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Andrew Clegg ▴ 310

0

Entering edit mode

cool ! I didn't know the fold cmd !

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Neither did I, but I knew ONE of the coreutils must be able to do that ;-)

ADD REPLY • link 14.7 years ago by Andrew Clegg ▴ 310

Ram · Answer 2 · 2010-07-26

12

Entering edit mode

14.7 years ago

Lars Juhl Jensen 11k

Take 2 for a Perl command line script:

perl -lne '$s.=$_ unless /^>/; if(/^>/||eof){$i=0;$s=~s/(.{1,60})/printf("%s_%05d\n%s\n",$n,++$i,$1);/ge; $n=$_;}' in.fa > out.fa

In contrast to take 1, this one actually works. Runtime is 13 seconds for human chromosome

The number 60 inside the script is the length of subsequences to be extracted.

Edit: It just occurred to me that it can be done slightly less ugly. Take 3:

perl -ne 'BEGIN{$/=">"}{s/(.*)//;$n=$1;s/\n//g;$i=0;s/(.{1,60})/printf(">%s_%05d\n%s\n",$n,++$i,$1);/ge;}' in.fa > out.fa

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

@Lars Thanks for the answer (even if I don't understand it yet... I only started Perl this week end ;) Please consider editing your posts rather than deleting them and posting a new one. Cheers!

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Thanks - I thought that deleting would delete immediately. And when it didn't, I assumed that undelete could undelete it ... but no, I could only vote for undeleting. Not exactly intuitive if you ask me ;) Had I know that, I would obviously have edited instead.

ADD REPLY • link 14.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

i have used this 3 one. but the header line is repeated in all reads. i want to remove all header lines except the first one. plz help me. thank you in advance

ADD REPLY • link 6.3 years ago by tamil05surya • 0

Ram · Answer 3 · 2010-07-26

11

Entering edit mode

14.7 years ago

brentp 24k

if you have pyfasta installed:

$ pyfasta split -n 1 -k 60 chr1.fa

does what you want, with headers like

>chr1_1
>chr1_61

you can also split into multiple files and with overlapping sequence like:

$ pyfasta split -n 3 -k 60 -o 20 chr1.fa

which will make 60 (-k) mers where each overlaps the previous by 20 (-o) bp and they will be split as evenly as possible among 3 (-n) files.

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by brentp 24k

1

Entering edit mode

Cool! The best code is always the one that is already written!

ADD REPLY • link 14.7 years ago by Istvan Albert 102k

0

Entering edit mode

I think this demonstrates the power of your brainchild :) I'll spend some time learning it and digging through the code. Thanks!

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

@eric, the code responsible for that functionality could use a lot of cleanup. if you look through it feel free to fix up and send a pull request via github: http://github.com/brentp/pyfasta :)

ADD REPLY • link 14.7 years ago by brentp 24k

0

Entering edit mode

@brentp, I guess you are refering only to the code in the split_fasta.py file? (this file's only import is Fasta from pyfasta.py, which I guess should be to your liking).

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

@eric yes, that's it.

ADD REPLY • link 14.7 years ago by brentp 24k

Ram · Answer 4 · 2010-07-26

8

Entering edit mode

14.7 years ago

Neilfws 49k

My last solution: the EMBOSS program named splitter:

splitter -sequence myfile.fa -size 60 -outseq outfile.fa

Again, header lines look like >XYZ_gene_1-60 (length = 449).

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Neilfws 49k

0

Entering edit mode

Nice. Never knew about that one, thanks.

ADD REPLY • link 14.7 years ago by Aaronquinlan 12k

0

Entering edit mode

that is nice! i'm going to check out all the EMBOSS tools now.

ADD REPLY • link 14.7 years ago by brentp 24k

Ram · Answer 5 · 2010-07-26

5

Entering edit mode

14.7 years ago

Rajarshi Guha ▴ 880

BioPython based code - not particularly memory efficient (for that I'd process the fasta file directly, rather than go via a toolkit)

from Bio import SeqIO, Seq
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import sys

def chunk(s, n):
    for i in range(0, len(s), n): yield s[i:i+n]

def process(filename, chunkSize):
    seqs = SeqIO.parse(open(filename), 'fasta')
    ohandle = open('output.fasta', 'w')
    for seq in seqs:
        chunks = []
        n = 1
        for substr in chunk(seq.seq.tostring(), chunkSize):
            chunks.append(SeqRecord( Seq(substr), id='%s_%05d' % seq.id, n), description=seq.description ))
            n += 1
    SeqIO.write(chunks, ohandle, 'fasta')
    ohandle.close()

if __name__ == '__main__':
    process(sys.argv[1], int(sys.argv[2]))

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Rajarshi Guha ▴ 880

0

Entering edit mode

+1 However, please make the length (here 60) sys.argv[2]) ;)

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

+1 However, please make the length (here 60) sys.argv[2] and ohandle use sys.argv[3] ;)

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Eric Normandeau 11k

Ram · Answer 6 · 2010-07-26

A python itertools solution capable of dealing with virtually infinite sized fasta file.

import sys
from itertools import chain, groupby, imap, islice

def fasta_iter(handle):
    """Yields headers and sequences as generators."""
    line_iter = imap(lambda x: x.strip(), handle)
    for is_header, group in groupby(line_iter, lambda x: x.startswith('>')):
        if is_header:
            header = group.next()
        else:
            seq = chain.from_iterable(group)
            yield header, seq

def take(n, iterable):
    return list(islice(iterable, n))

def digest(in_file, width, out_file):

    for header, seq in fasta_iter(in_file):
        siter = iter(seq)       
        block = ''.join(take(width, siter))
        c = 0
        while len(block):
            c += 1
            print c
            out_file.write('%s_%06d\n%s\n' % (header, c, block))    
            block = ''.join(take(width, siter))

if __name__ == '__main__':
    args = sys.argv[1:]
    ifile = args[0]
    width = int(args[1])
    ofile = args[2]
    with open(ifile) as ihandle:
        with open(ofile, 'w') as ohandle:
            digest(ihandle, width, ohandle)

score 5 · Answer 7 · 2010-07-28

Haskell

I'm still pretty much a newbie to Haskell, although I have been learning a lot these past two weeks. The first version of this that I wrote was crazy slow, and used a lot of memory to boot. But this version runs in constant space--I ended up re-writing it to use explicit tail recursion, and I'm once again using lazy I/O to keep the core code purely functional.

Speed wise, since all of the running time measurements on different machines can't really be compared, we could instead compare the running times of our solutions to the speed of a program that we've all got installed. If I send the output of my program to /dev/null, then it has similar I/O characteristics to 'wc', so I'd like to suggest that as a measuring stick.

At a digest size of 60 on chr1.fa, this code (writing to /dev/null) runs in a little over 15 seconds on my machine, which is about 1.6 times the time it takes to run 'wc' on the chr1.fa file. It supports multiple-sequence files and appears to handle most of the corner cases right.

edit: There's something I still don't get about lazy IO with file handles. It works just fine if I use the built-in "interact" (convenience function for taking input from stdin and writing to stdout), though. My original posted version was losing part of the end of the output, but this one doesn't. This version's run-time is almost 1.7 times the wc time.

[?]

Example:

[?]

Thanks for the problem, Eric!

Ram · Answer 8 · 2010-07-26

3

Entering edit mode

14.7 years ago

Neilfws 49k

If you install the auxiliary scripts when you install Bioperl, you'll find a script for this purpose named bp_split_seq. Run it like this (assuming chunk size = 60):

bp_split_seq -c 60 -f myfile.fa

You can also specify overlaps and there's an option to index the output sequences. It does not quite do what's asked in the question: split sequences are stored separately in a directory named split, with names such as XYZ_gene_1-61.faa and headers like >XYZ_gene_121-181 (length = 449), but it's easy enough to concatenate them if required.

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Neilfws 49k

0

Entering edit mode

NOTE: Please consider leaving a comment if you downvote an answer, cheers!

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Someone is going after several of my answers in rapid succession; guess I offended them :-)

ADD REPLY • link 14.7 years ago by Neilfws 49k

0

Entering edit mode

Still very interesting and the output is almost similar to what pyfasta does in terms of the sub-sequences naming convention. Thanks!

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k

Ram · Answer 9 · 2010-07-26

3

Entering edit mode

14.7 years ago

Pierre Lindenbaum 166k

Re-inventing the wheel is FUN :-)

My solution using GNU/Flex + C:

%{
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <assert.h>

#define MIN(a,b) (a<b?a:b)

static char* title=NULL;
static char* buffer=NULL;
static unsigned long fragment_size=0L;
static unsigned long buffer_size=0;
static int sequence_index=0;

static void overflow()
    {
    ++sequence_index;
    fprintf(stdout,"%s_%05d\n",title,sequence_index);
    fwrite(buffer,sizeof(char),buffer_size,stdout);
    fputc('\n',stdout);
    buffer_size=0;  
    }

%}
%option never-interactive
%option noyywrap
%%
^>.*\n   {
     if(buffer_size>0) overflow();
     title=realloc(title,sizeof(char)*yyleng);
     if(title==NULL)
    {
    fprintf(stderr,"Out of memory\n");
    exit(EXIT_FAILURE);
    }
     strncpy(title,yytext,yyleng-1);
     sequence_index=0;
     }

^[A-Za-z]+  {
    int nCopy=0;
    while(nCopy!=yyleng)
        {
        int len=MIN(fragment_size - buffer_size,yyleng-nCopy);
        memcpy(&amp;buffer[buffer_size],&amp;yytext[nCopy],len);
        buffer_size+=len;
        nCopy+=len;
        if(buffer_size==fragment_size)
            {
            overflow();
            }
        }
    }

[ \t\n] ;

.   {
    fprintf(stderr,"Illegal character  \"%s\".", yytext);
    exit(EXIT_FAILURE);
    }

<<EOF>> {
    if(buffer_size >0) overflow();
    free(title);
    title=NULL;
    buffer_size=0;
    return 0;
    }

%%

int main(int argc,char** argv)
    {
    int optind=1;

    while(optind<argc)
        {
        if(strcmp(argv[optind],"-h")==0)
            {
            fprintf(stderr,"Options:");
            fprintf(stderr," -n <int> chunk size");
            return EXIT_SUCCESS;
            }
        else if(strcmp(argv[optind],"-n")==0)
            {
            fragment_size=atol(argv[++optind]);
            }
        else if(strcmp(argv[optind],"--")==0)
            {
            ++optind;
            break;
            }
        else if(argv[optind][0]=='-')
            {
            fprintf(stderr,"bad options \"%s\".\n",argv[optind]);
            return  EXIT_FAILURE;
            }
        else
            {
            break;
            }
        ++optind;
        }
    if(fragment_size<=0UL)
        {
        fprintf(stderr,"bad size: %ld\n",fragment_size);
        exit(EXIT_FAILURE);
        }
    buffer=malloc(sizeof(char)*fragment_size);
    if(buffer==NULL)
        {
        fprintf(stderr,"Out of memory\n");
        exit(EXIT_FAILURE);
        }
    if(optind==argc)
        {
        yylex();
        }
    else
        {
        while(optind< argc)
            {
            char* filename=argv[optind++];
            errno=0;            
            yyin=fopen(filename,"r");
            if(yyin==NULL)
                {
                fprintf(stderr,"Cannot open %s. %s\n",filename,strerror(errno));
                exit(EXIT_FAILURE);
                }
            yylex();
            fclose(yyin);
            }
        }
    free(buffer);
    return 0;
    }

Compilation:

flex -f golf.l
gcc -Wall -O3 lex.yy.c

Test:

time ./a.out -n 60 chromosomes/chr1.fa > /dev/null 
real    0m2.280s
user    0m2.176s
sys 0m0.108s

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Pierre Lindenbaum 166k

4

Entering edit mode

It's only re-inventing if it isn't better, or faster, or cooler, or shinier. If I've understood MTV correctly, I believe what you've done there is pimped the wheel ;)

ADD REPLY • link 14.7 years ago by Cassj ★ 1.3k

1

Entering edit mode

Ah, that's a markdown thing, right? I didn't realize--I've just been sticking [?] tags in my answers and manually escaping the angle brackets. I ran the gist and it takes just over 3 seconds on my machine, or a bit more than five times faster than my haskell version. Nice! I don't get why 'wc' takes three times longer than your flex code; I've been using it as a yarstick of something minimally-cpu-intensive that has to read the whole input but maybe it's not a good example. Maybe it's doing some kind of unicode or locale-based processing? I don't know.

ADD REPLY • link 14.7 years ago by Mitch Skinner ▴ 660

0

Entering edit mode

Crikey, speedy. Memory usage?

ADD REPLY • link 14.7 years ago by Andrew Clegg ▴ 310

0

Entering edit mode

@andrew memory usage = ~(buffered) size of the segment ("-n") + the size of the fasta title.

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Is this the whole thing? flex is giving me the error

golf.l:144: premature EOF

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Mitch Skinner ▴ 660

0

Entering edit mode

Mitch, did you remove the whitespaces on the left margin ?

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I put the code in GIST

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Ah, that's a markdown thing, right? I didn't realize -- I've just been sticking <pre> tags in my answers and manually escaping the angle brackets. I ran the gist and it takes just over 3 seconds on my machine, or a bit more than five times faster than my haskell version. Nice! I don't get why 'wc' takes three times longer than your flex code; I've been using it as a yardstick of something minimally-cpu-intensive that has to read the whole input but maybe it's not a good example. Maybe it's doing some kind of unicode or locale-based processing? I don't know.

ADD REPLY • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Mitch Skinner ▴ 660

Ram · Answer 10 · 2010-07-26

And here the R & Bioconductor solution using the Biostrings package (edited after asking the BioC mailling list and thanks to getting helpful response).

The crucial point was actually to get an efficient solution for writing FASTA files working, so there are two solutions.

The "official" solution a bit simplified (only read 1. entry) to make the steps more clear:

require(Biostrings)
digest.fasta <-function(infas, size=60, outfas) {
  dnasts <- read.DNAStringSet(file=infas)
  dnaviews <- Views(dnasts[[1]], 
              successiveIRanges(
              rep(size, ceiling(length(dnasts[[1]]) / size))))

  write.XStringSet(as(dnaviews, "DNAStringSet"), file=outfas))
}

This is pretty much the same as my first attempt, but now I was told it works within R-devel (1.12.0) and the development version of Bioconductor, so this has been improved pretty much. Up to Biostrings_2.16.9 it's very slow though.

For the time being we need a replacement for the writeFASTA (as slow as the write.XStringSet) function. This time, made only from basic language ingredients:

writeFASTA.2 <- function(sequences, desc=names(sequences), width=80, file=stdout(), append=FALSE) {
  ## convert whatever sequence object to character vector
  sequences <- as.character(sequences)
  if(!is.null(desc) && length(sequences) != length(desc))
     stop("wrong length of 'desc'")
  ## if there are no names, make them
  desc <- if (is.null(desc))
    paste(">", seq(along=sequences))
  else
    paste(">", desc, "\n", sep="") 
  ## Adjust line width or add a \n 
  sequences <-  if (! is.null(width))
    gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences )
  else
    paste(sequences, "\n", sep="")
  ## paste is faster than sprintf, print the output
  cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file, append=append )
}

This function can also handle to adjust the lengths of the FASTA lines. However, if this is not needed, then this function is faster:

writeFASTA.4 <- function(dna,  desc=names(dna),  file=stdout()) {
  if (is.null(desc))
    desc <- paste(seq(along=dna))
  fasta = character(2 * length(dna))
  fasta[c(TRUE, FALSE)] <- paste(">", desc, sep="")
  fasta[c(FALSE, TRUE)] <- as.character(dna)
  writeLines(fasta, file)
}

With the whole thing looking like this:

digest.fasta <-function(infas, size=60, outfas) {
  dnasts <- read.DNAStringSet(file=infas)
  dnaviews <- Views(dnasts[[1]], 
               successiveIRanges(
               rep(size, ceiling(length(dnasts[[1]]) / size))))

  writeFASTA.4(dnaviews, file=outfas)
}

> system.time (digest.fasta(infas="chr1.fa", size=60, outfas=tempfile()) )
   user  system elapsed 
 73.345   1.396  74.815

Ram · Answer 11 · 2010-07-27

This is in line with brentp's "not-invent-wheels" mentality. In the past I have used faSplit from Kent source, it is quite fast. Usage:

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
Files split by sequence will be broken at the nearest fa record boundary.
Files split by base will be broken at any base.
Files broken by size will be broken every count bases.

So for the question, the command might look like this:

faSplit size input.fa 60 outRoot

Ram · Answer 12 · 2010-07-26

A quick attempt in BioRuby just before bed - I may need to edit this in about 9 hours :-)

#!/usr/bin/ruby
require "rubygems"
require "bio"

n = ARGV[1].nil? ? 60 : ARGV[1].to_i

Bio::FlatFile.open(ARGV[0]) do |ff|
  ff.each do |seq|
    1.step(seq.length, n) { |s|
      puts ">#{seq.entry_id}_#{(s / n).to_s.rjust(5, "0")}", seq.seq[s..(s + (n - 1))]
    }
  end
end

Save and run as, e.g. for N = 70:

digest_fasta.rb myfile.fa 70 > outfile

It takes the input filename as first argument. Subsequence length is optionally the second argument or 60 by default. Then we read the fasta file, use step to get the subsequences and print out a new header + sequence, padding the sequence count with 5 zeros. Of course for a huge file, you'd need more than 5 zeros, but I guess you could calculate the required number.

There's a similar example on the BioRuby tutorial page (search for "Divide a genome sequence into sections of 10000bp") - it uses the window_search() method, but has to deal with remainders.

No idea how fast this runs for Chr1 - something for tomorrow. Oh - and as it stands, does not wrap long sequence lines.

Ram · Answer 13 · 2010-07-26

2

Entering edit mode

14.7 years ago

Aaronquinlan 12k

Assuming you have faToTab from kent src installed, you can combine it with awk.

Change the len variable in the BEGIN block accordingly.

faToTab chr1.fa stdout | awk 'BEGIN {len=60} { \
                            for (i=0; i<length($2); i+=len) \
                            { \
                              print ">"$1"_"(i/len)+1"\n"substr($2, i, len) \
                            } \
                         }' > out

Not terribly fast (19s), but simple as it can easily be dropped into a one line bash script that takes len as an argument.

Aaron

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Aaronquinlan 12k

0

Entering edit mode

Could someone comment on why this was down-voted? Do you believe it to be incorrect?

ADD REPLY • link 14.7 years ago by Aaronquinlan 12k

0

Entering edit mode

I think someone is on a down-voting spree. I had 2 down here and 2 at another question in a short time span. Perhaps they are offended :-)

ADD REPLY • link 14.7 years ago by Neilfws 49k

Ram · Answer 14 · 2010-07-30

Here's a Clojure solution to the problem. Like Mitch, I appreciate the ideas; it's a fun chance to practice in a new language.

The tricky part with Clojure was making this fully lazy for memory efficiency while trying to remain general with the implementation. This uses the line based FASTA parser from the University of Tokyo Genome Browser Development Toolkit to handle the underlying IO and then is careful to only grab lines as needed, which scales to both big sequences and files with lots of little sequences.

(import '(org.utgenome.format.fasta FASTAPullParser))

(use '[clojure.java.io]
     '[clojure.contrib.duck-streams :only (with-out-writer)]
     '[clojure.contrib.str-utils2 :only (join)]
     '[clojure.contrib.seq :only (indexed)])

(defn seq-iterator [parser]
  "Lazily retrieve sequence lines one at a time from the FASTA parser."
  (lazy-seq
    (when-let [line (.nextSequenceLine parser)]
      (cons line (seq-iterator parser)))))

(defn partition-seq [parser piece-size]
  "Break up a sequence collection into indexed pieces, pulling lines as needed."
  (->> (seq-iterator parser)
    (mapcat seq)
    (partition-all piece-size)
    indexed))

(defn write-fasta-pieces [in-file piece-size out-file]
  (with-out-writer out-file
    (let [parser (new FASTAPullParser (file in-file))
          piece-size (. Integer parseInt piece-size)]
      (loop [work-parser parser]
        (when-let [cur-id (.nextDescriptionLine work-parser)]
          (doseq [[i s] (partition-seq work-parser piece-size)]
            (println (format ">%s_%s\n%s" cur-id (+ i 1) (join "" s)))
              )
          (recur work-parser))))))

(when *command-line-args*
  (apply write-fasta-pieces *command-line-args*))

Ram · Answer 15 · 2010-07-26

1

Entering edit mode

14.7 years ago

Rajarshi Guha ▴ 880

OK, another Python version, written for a fa file with a single seq (but still not memory efficient). Runs in 13s

import sys

def chunk(s, n):
    for i in range(0, len(s), n): yield s[i:i+n]

def process2(filename, chunkSize):
    f = open(filename, 'r')
    f.readline()
    d = f.read().replace("\n", "")
    n = 1
    for substr in chunk(d, chunkSize):
        print '>id_%05d\n%s' % (n, substr)
        n += 1

if __name__ == '__main__':
    process2(sys.argv[1], 60)

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 14.7 years ago by Rajarshi Guha ▴ 880

0

Entering edit mode

NOTE: Please consider leaving a comment if you downvote an answer, cheers!

ADD REPLY • link 14.7 years ago by Eric Normandeau 11k