Entering edit mode
10.0 years ago
alex.rubinsteyn
▴
190
I'm interested in analyzing large FASTA files (like the human genome and proteome) in parallel using Spark or pydoop. Is there a library which implements FASTA parsing as a Hadoop InputFormat?
"Hadoop FASTA reader" at gist.github.com/jflatow/45551 ?
This looks like it works well for a FASTA file with many small records (since it seeks locally on each worker). However, for a FASTA file with large contigs (like the genome) this wouldn't perform very well.