Tool for random access to indexed BAM files in S3?
3
4
Entering edit mode
9.4 years ago
donfreed ★ 1.6k

I have access to some indexed BAM files in S3. Using the AWS CLI, it is fairly easy to download the entire BAM file. That was fine for our initial analysis, but for validation, we are interested in looking a small region of ~10kb in thousands of individuals.

The BAM files are indexed and S3 supports GET requests with range headers so this should be possible. Does anyone know of a tool that does this?

EDIT: htslib 1.3 was recently released and supports random access to BAM files in s3.

BAM cloud • 12k views
ADD COMMENT
0
Entering edit mode

I have a private aws bucket with aws_access_key_id, aws_secret_access_key, and region

How set I set up samtools 1.3 to make it work on bam files on aws s3 storage?

follow the instruction above I can make the following command work

$ samtools view s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam 20:1000-100000 # works

but I can not make samtools work on my private bucket.

Any suggestion?

ADD REPLY
9
Entering edit mode
9.4 years ago

The upcoming samtools 1.3 release will support this. As well as the current way of accessing public buckets in donfreed's comment on another answer, samtools 1.3 will understand s3: pseudo-URLs like

s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.blah.bam

For accessing private buckets, samtools will look for your AWS credentials in the usual configuration files and environment variables, or you can specify them on the command line as s3://id:secret@bucket/... though that's not particularly recommended.

This release will be fairly soon. In the meantime, you can try this out by building samtools with GitHub htslib's libcurl branch. The code in that branch only looks for credentials in $AWS_ACCESS_KEY_ID / $AWS_SECRET_ACCESS_KEY and as id:secret in the URL.

ADD COMMENT
0
Entering edit mode

Thank you. This worked with only slight modification.

ADD REPLY
0
Entering edit mode

Can you please describe what you have done, it would be very helpful.

ADD REPLY
2
Entering edit mode

No problem.

I am not sure how the official htslib implementation is coming along, but these steps work for me.

  1. Start and ssh into an EC2 instance.
  2. Install some software and libriaries:

    sudo yum install gcc autoconf git zlib-devel libcurl-devel openssl-devel ncurses-devel
    
  3. Clone my fork of htslib's libcurl branch.

    git clone https://github.com/DonFreed/htslib.git -b libcurl
    
  4. Build and install htslib.

    cd htslib/
    autoconf
    ./configure --enable-libcurl
    sudo make install
    
  5. Clone and install samtools.

    cd ..
    git clone https://github.com/samtools/samtools.git -b 1.2
    cd samtools
    sudo make install LDLIBS+=-lcurl LDLIBS+=-lcrypto
    
  6. Configure AWS.

    aws configure
    AWS Access Key ID [None]: ******
    AWS Secret Access Key [None]:  ********
    Default region name [None]: us-east-1
    Default output format [None]:
    
  7. Set environment.

    export AWS_ACCESS_KEY_ID=*****
    export AWS_SECRET_ACCESS_KEY=*****
    
  8. Test.

    samtools                    # works
    aws s3 ls s3://1000genomes/ # works
    samtools view s3://1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam 20:1000-100000 # works
    
ADD REPLY
1
Entering edit mode

Just tried this and it worked! Thanks!

Only problem I had was:

git clone https://github.com/samtools/samtools.git -b 1.2

For some unknown reason, I had to checkout the 1.2 branch manually inside the directory

ADD REPLY
1
Entering edit mode

Just noting here that if you have a bucket name with a "." (period) in it, samtools view will not work. Hope that saves you the hours it took me to figure it out!

ADD REPLY
0
Entering edit mode

This is true but %2E is the HTML equivalent for period and substituting %2E for . (period) in your URL does seem to work.

ADD REPLY
0
Entering edit mode

@DonFreed, I saw you had support for using session tokens, is this only available in htsfile or is there a way to build samtools to include using AWS_SESSION_TOKEN? I work on the National Database for Autism Research (NDAR) project and I saw you referenced our use of temporary federated tokens to control access to s3.

I would be keen to show users this functionality baked into samtools; currently this can be done through the use of a proxy (https://github.com/obenshaindw/s3proxy) and writing the s3 urls in an http scheme that makes requests against the proxy.

ADD REPLY
1
Entering edit mode

Hi @david.obenshain, steps 1-5 above will build samtools with support for temporary session tokens. In addition to the steps above, specifying the environmental variable AWS_SESSION_TOKEN is necessary.

Here's a detailed example of accessing NDAR on AWS EC2.

1. Perform steps 1-5 above.

$ sudo yum install gcc autoconf git zlib-devel libcurl-devel openssl-devel ncurses-devel
$ git clone https://github.com/DonFreed/htslib.git -b libcurl
$ cd htslib
$ autoconf
$ ./configure --enable-libcurl
$ sudo make install
$ cd ..
$ git clone https://github.com/samtools/samtools.git -b 1.2
$ cd samtools
$ sudo make install LDLIBS+=-lcurl LDLIBS+=-lcrypto

2. Get access keys from NDAR. See the NDAR cloud_page to download downloadmanager.jar. More information can be found from the ndar_tutorials.

$ cd ~/
$ unzip downloadmanager.zip
$ java -jar downloadmanager.jar -u $ndar_user_id -p $ndar_pswd -g aws_keys.txt
$ cat aws_keys.txt
accessKey=*****
secretKey=*****
sessionToken=*****...*****

3. Configure AWS. Add the token to the credential file.

$ aws configure
AWS Access Key ID [None]: ******
AWS Secret Access Key [None]:  ********
Default region name [None]: us-east-1
Default output format [None]:
$ echo "aws_session_token = *****...*****" >> .aws/credentials

4. Set environmental variables.

$ export AWS_ACCESS_KEY_ID=*****
$ export AWS_SECRET_ACCESS_KEY=*****
$ export AWS_SESSION_TOKEN=******

5. Test.

$ samtools   # works
$ aws s3 ls s3://NDAR_Central_4/submission_10215/complete/11000/complete_bams/11000.fa.realigned.recal.bam     # works
$ samtools view s3://NDAR_Central_4/submission_10215/complete/11000/complete_bams/11000.fa.realigned.recal.bam 10:1000000-1010000 # works
ADD REPLY
1
Entering edit mode

Hi Don,

Thank you, this does work. I was also able to do the following to have bcftools built to work with your libcurl branch. Completing the steps above first...

$ cd ~
$ git clone https://github.com/samtools/bcftools -b 1.2
$ cd bcftools
$ git checkout -b 1.2
$ sudo make install LDLIBS+=-lcurl LDLIBS+=-lcrypto
$ bcftools view s3://NDAR_LOCATION_FOR_VCF.vcf # Works

I would like to put a few gists on our GitHub account related to this, if you don't mind. You might also want to checkout https://github.com/NDAR/nda_aws_token_generator to skip step 2. When support for the security token is added to the official samtools/htslib we can update the gists accordingly.

I was really excited to stumble on this, and had the coincidence of giving a small demo of this functionality to Dr. Pevsner yesterday.

David

ADD REPLY
1
Entering edit mode

See also PR #303 which expands on Don's branch to also read credentials (including session tokens) from the usual configuration files (mostly ~/.aws/credentials). This will be landing in htslib shortly, but it would be good if some of you S3 users could do some testing first.

ADD REPLY
0
Entering edit mode

Unfortunately, this does not seem to work for requester-pay buckets. It works for the other examples listed above.

ADD REPLY
1
Entering edit mode

Requester Pays buckets need an extra x-amz-request-payer: requester header that at present samtools doesn't set, so this will indeed not work at present.

Clearly it would not be appropriate for htslib/samtools to set it all the time (as it represents explicit acknowledgement from the user that they will be charged). So we could set it if some flag was present in the URL or perhaps via an extra config file key on the profile used. @donfreed or anyone else: are you aware of any existing practice in this area?

ADD REPLY
1
Entering edit mode

This is now HTSlib issue #346; hopefully we'll come up with a way to say "yes, charge me!" in time for the 1.4 release.

ADD REPLY
0
Entering edit mode

There is a workaround available in that you can use pre-signed URLs as described in the github issue, however we recognise the limitations of this workaround and will look into alternatives.

ADD REPLY
1
Entering edit mode
2.6 years ago
Mark ▴ 10

If you would like to use HTSJDK I have recently open-sourced a Java S3 NIO SPI for S3 that supports random reads of S3 objects. If the jar file of the lib is on your class path then HTSJDK will automatically use this lib when an s3:// URI is detected.

https://github.com/awslabs/aws-java-nio-spi-for-s3

Disclosure: I am the author of the library, work for AWS Health AI genomics and have a financial interest in Amazon

ADD COMMENT
0
Entering edit mode
9.4 years ago
h.mon 35k

SAMtools view can download only chunks of a BAM file over the internet, using ftp or http. I am not sure if it can download over encrypted connections.

ADD COMMENT
2
Entering edit mode

Samtools view will work for 'public' buckets, but not for private buckets. For example, the command below will work on both my local machine and EC2, but a similar command will not work for the private buckets I am accessing.

samtools view http://s3.amazonaws.com/1000genomes/phase1/data/NA12878/exome_alignment/NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam 1:1000000-1001000
ADD REPLY
1
Entering edit mode

Perhaps one could modify the source code to SAMtools to extend support for https and to also submit the required XML payload to S3, to retrieve chunks of data.

ADD REPLY

Login before adding your answer.

Traffic: 1963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6