Question

Sampling with replacement from a bam file

0

Entering edit mode

6.9 years ago

harleen.sn ▴ 10

Hi,

I would like to randomly sample with replacement from bam files. One way to do that would be to convert the bam files to sam format and use python's np.random.choice. Is there a more direct way to do this that doesn't require bam to sam conversion? Something like samtools view -s but with replacement?

Thank you for your help!

bam sample with replacement • 3.2k views

ADD COMMENT • link updated 6.9 years ago by Alex Reynolds 36k • written 6.9 years ago by harleen.sn ▴ 10

2

Entering edit mode

You got me curious, why do you want to sample with replacement?

ADD REPLY • link 6.9 years ago by h.mon 35k

0

Entering edit mode

I am exploring my data and wanted to measure the shot noise or variance in counts (number of reads per genomic region) upon resampling with replacement. As expected, it followed a Poisson distribution, which was nice to see. I can now add this variance to the biological variance for downstream analyses.

ADD REPLY • link 6.8 years ago by harleen.sn ▴ 10

0

Entering edit mode

Could you explain what do you mean by replacement?

ADD REPLY • link 6.9 years ago by venu 7.1k

1

Entering edit mode

By sampling with replacement, I mean that one could draw the same read more than once by random chance, as opposed to sampling without replacement, where a read can only be drawn once from a population, which is what I think samtools view -s does.

For example, if I wanted to randomly sample 10 numbers between 1-10, without replacement I would get 1,2,3,4,5,6,7,8,9,10 whereas with replacement I might get, 1,1,2,4,4,5,7,7,8,9.

ADD REPLY • link 6.9 years ago by harleen.sn ▴ 10

0

Entering edit mode

Something like this

ADD REPLY • link 6.9 years ago by lakhujanivijay 5.9k

score 4 · Answer 1 · 2018-06-08

4

Entering edit mode

6.9 years ago

Devon Ryan 105k

In python:

import pysam
import numpy as np

bam = pysam.AlignmentFile("something.bam")
total = bam.mapped
selected = sorted(np.random.choice(total, size=1000))  # sample 1000 reads, with replacement
idxSelected = 0
idxBam = 0
for read in bam:
    if read.is_unmapped:
        continue
    if idxSelected > len(selected):
        break
    while idxBam = selected[idxSelected]:
        ... do something ...
        idxSelected += 1
    idxBam += 1

ADD COMMENT • link 6.9 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks, Devon! I was able to build on your code with minor modifications. Using pysam was a good idea. I just had to remember to keep my bam index files in the same directory!

ADD REPLY • link 6.9 years ago by harleen.sn ▴ 10

score 3 · Answer 2 · 2018-06-08

Assuming that you can't find a tool that works natively with BAM without converting to SAM, I have a tool called sample that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.

$ ./sample --help
sample
  version: 1.0.2
  author:  Alex Reynolds

Usage: sample [--sample-size=n] [--lines-per-offset=n] [--sample-without-replacement | --sample-with-replacement] [--shuffle | --preserve-order] [--hybrid | --mmap | --cstdio] [--rng-seed=n] <newline-delimited-file>

...
  --sample-with-replacement     | -r      Sample with replacement (optional)
...

You could run sample on a SAM file to sample with replacement from that file.

score 1 · Answer 3 · 2018-06-08

For fun, here's another generic approach that uses seq, shuf, and sed to sample a text file with replacement:

$ LINES=`wc -l reads.sam | awk '{print $1}'`
$ N=1234
$ for LINE in `seq ${LINES} | shuf -r -n ${N}`; do sed "${LINE}q;d" reads.sam >> sample.sam; done

The seq command generates a list of line numbers, shuf -r samples N of those line numbers with replacement, and sed prints out the line, which is appended to the output.

Jumping around the input file like this might be kinda slow. To get better performance, maybe pipe the list of line numbers to sort -n and then sed on that, to read out an in-order sample:

$ for LINE in `seq ${LINES} | shuf -r -n ${N} | sort -n`; do sed "${LINE}q;d" reads.sam >> sample.sam; done