How To Access Bam Files Directly Via Hadoop?
2
0
Entering edit mode
12.2 years ago
jtal04 • 0

Has anyone used the 1000 genomes public data set available on Amazon s3?

Or, I should ask- has anyone used the BAM files directly via an AWS service such as elastic mapreduce?

I can download the files to EBS, unpack them, and reupload them to s3 but that is more expensive (and more work) than the public/free copy.

Thank you for any insight, Justin

Edit: I am currently looking into hadoop-bam http://sourceforge.net/projects/hadoop-bam/

bam 1000genomes • 3.2k views
ADD COMMENT
1
Entering edit mode
12.2 years ago

You already answered your own question - you could use Hadoop-BAM. You might also want to check out SeqPig, which lets you perform Pig queries against your BAM files (and other things).

ADD COMMENT
0
Entering edit mode
12.2 years ago
JC 13k

did you read and try the tutorial? http://www.1000genomes.org/using-1000-genomes-data-amazon-web-service-cloud

ADD COMMENT
0
Entering edit mode

Yes. It says to access the data the generic way you access s3 data. Next it describes how to start up an ec2 image from their AMI and the remaining is a tutorial that is not specific to AWS. I guess the title of my question is misleading- my real problem is trying to use bam files in elastic mapreduce, which doesnt know how to split records in a bam file.

ADD REPLY
1
Entering edit mode

oh I got it, please edit your post, maybe someone here knows the answer.

ADD REPLY

Login before adding your answer.

Traffic: 2715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6