Hadoop InputFormat for FASTA files?

2

Entering edit mode

10.0 years ago

alex.rubinsteyn ▴ 190

I'm interested in analyzing large FASTA files (like the human genome and proteome) in parallel using Spark or pydoop. Is there a library which implements FASTA parsing as a Hadoop InputFormat?

hadoop fasta • 2.7k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by alex.rubinsteyn ▴ 190

0

Entering edit mode

"Hadoop FASTA reader" at gist.github.com/jflatow/45551 ?

ADD REPLY • link 10.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

This looks like it works well for a FASTA file with many small records (since it seeks locally on each worker). However, for a FASTA file with large contigs (like the genome) this wouldn't perform very well.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by alex.rubinsteyn ▴ 190

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 2585 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6