Hi,
I had to come up with a code last week to solve this small problem and thought it may be another nice idea for a code golf.
Here are the requirements:
Given a fasta file (potentially huge), return a fasta file containing smaller sequences of length 'n' that represent all non-overlaping subsequences of the original sequences. Rename the new sequences according to the scheme 'oldSeqName_00001" and up, where 'oldSeqName' is the original name for the sequence.
Please use this test file.
Example run :
digest_fasta(fasta_in, 60, fasta_out)
inputs:
- a number of nucleotides for each sub-sequence (n)
- a fasta file
EXAMPLE:
XYZ_gene CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA GGAGACGGAGAGACACTTCATCATGCAGG
outputs: - a fasta file
EXAMPLE (after digesting in 60pb long fragments):
XYZ_gene_00001 CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT XYZ_gene_00002 CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA XYZ_gene_00003 GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC XYZ_gene_00004 ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT XYZ_gene_00005 GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA XYZ_gene_00006 CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA XYZ_gene_00007 GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA XYZ_gene_00008 GGAGACGGAGAGACACTTCATCATGCAGG
And so on for all the sequences in the 'fasta_in' file!
For this week's requirement, I'll try to be clearer: The program should run under linux.
That's it!
Use whatever language, approach, library you favor! Submit multiple answers if they are significantly different!
I'll accept the most voted answer on thursday around 16h Eastern Canada. Do vote for any answer you find interesting, even if they come in late!
Just like last week, diversity will be very appreciated!
Can't wait to see you code ;)
CONCLUSION: Thanks to all who participated with their answers or discussions! It's a pleasure to see all the different approaches and learn new tricks :)
Cheers
Dear whoever's downvoting without saying why -- it'd be much more constructive to find out why you didn't like a particular solution
Yes keep deleting comments that are addressed or obsolete.
An example (and canonical) input file would probably help with this one.
An example would indeed be appreciated. I'm not 100% sure what we are supposed to do this time.
ok, on the way :)
I think it would be great to attach a bounty to these questions - a little more incentive to work hard for the best answer!
Can we specify a common big fasta source? (This, perhaps?)
That's a nice idea :)
Can we send the code on http://gist.github.com/ ?
I suggest this as the "potentially huge fasta file"
Not sure to follow you @Pierre. You mean INSTEAD of here?
@Daniel Swan, I think the question has to be unanswered for a few days before you put a bounty...
@eric, I first glance I thought the code could be large. I was wrong, please ignore my comment about gist.
@eric I also removed dbsnp as the test file
@Pierre. Please edit the question to put your suggested test file :)
done. @Eric so ''stdin'' shouldn't be handled isn't it ? :-)
Note: in the fasta id, should the length parameter be adjusted? As here:
or should the id be unchanged? I would opt for the latter.
nope, that was only for example purpose, I'll delete it...
Just a random thought: Do you people mind if I delete some comments as issues get cleared and the comments become less usefull just to keep the list readable?
out of curiosity, do you have Matlab ands its associated bioinformatics toolbox?
@Will negative for Matlab. I haven't toyed with it since my 'engineer' in training days, about 10 years ago :) It seems to me that with so many viable open and free solutions out there, Matlab is not a first choice for many people doing research (thinking mainly of the grad students here). What is there that makes Matlab an interesting tool in you opinion?
Just out of interest, what are you using the n-mers in the output file for?
@CassJ Just one example. I could have long contigs (10 t0 50 kpb) representing assembly of 454 data sequenced from a 100+ kbp insert in which I know that there is a gene of interest, maybe in multible copies, but also possibly other genes. I can use 500 pb sub-sequences and blast them against a db (nr, swissprot, a custom DNA db) and try to identify the position of that gene.
Code golf: it's learning, and advocacy, and showing off all rolled into one! What programmer wouldn't love that?
Last week you suggested discussing problems on a google group. I went to google groups and searched for "project mendel" and saw the existing thread on the biostar-central group. I thought maybe it wouldn't be neighborly (to the people participating in the current thread) to start a separate thread elsewhere, but on the other hand I don't know if biostar-central really wants a big question thread. Of course, that's assuming that we'd get enough people to make a big thread!
Plus, I've been thinking about my own work and trying to imagine good questions, but so far I haven't thought of many problems that are the right size. Maybe one so far.
how to install digest_fasta in linux ubuntu?? Please kindly suggest me.. thanking you