Sampling with replacement from a bam file
3
0
Entering edit mode
6.6 years ago
harleen.sn ▴ 10

Hi,

I would like to randomly sample with replacement from bam files. One way to do that would be to convert the bam files to sam format and use python's np.random.choice. Is there a more direct way to do this that doesn't require bam to sam conversion? Something like samtools view -s but with replacement?

Thank you for your help!

bam sample with replacement • 2.9k views
ADD COMMENT
2
Entering edit mode

You got me curious, why do you want to sample with replacement?

ADD REPLY
0
Entering edit mode

I am exploring my data and wanted to measure the shot noise or variance in counts (number of reads per genomic region) upon resampling with replacement. As expected, it followed a Poisson distribution, which was nice to see. I can now add this variance to the biological variance for downstream analyses.

ADD REPLY
0
Entering edit mode

Could you explain what do you mean by replacement?

ADD REPLY
1
Entering edit mode

By sampling with replacement, I mean that one could draw the same read more than once by random chance, as opposed to sampling without replacement, where a read can only be drawn once from a population, which is what I think samtools view -s does.

For example, if I wanted to randomly sample 10 numbers between 1-10, without replacement I would get 1,2,3,4,5,6,7,8,9,10 whereas with replacement I might get, 1,1,2,4,4,5,7,7,8,9.

ADD REPLY
0
Entering edit mode

Something like this

ADD REPLY
4
Entering edit mode
6.6 years ago

In python:

import pysam
import numpy as np

bam = pysam.AlignmentFile("something.bam")
total = bam.mapped
selected = sorted(np.random.choice(total, size=1000))  # sample 1000 reads, with replacement
idxSelected = 0
idxBam = 0
for read in bam:
    if read.is_unmapped:
        continue
    if idxSelected > len(selected):
        break
    while idxBam = selected[idxSelected]:
        ... do something ...
        idxSelected += 1
    idxBam += 1
ADD COMMENT
0
Entering edit mode

Thanks, Devon! I was able to build on your code with minor modifications. Using pysam was a good idea. I just had to remember to keep my bam index files in the same directory!

ADD REPLY
3
Entering edit mode
6.6 years ago

Assuming that you can't find a tool that works natively with BAM without converting to SAM, I have a tool called sample that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.

$ ./sample --help
sample
  version: 1.0.2
  author:  Alex Reynolds

Usage: sample [--sample-size=n] [--lines-per-offset=n] [--sample-without-replacement | --sample-with-replacement] [--shuffle | --preserve-order] [--hybrid | --mmap | --cstdio] [--rng-seed=n] <newline-delimited-file>

...
  --sample-with-replacement     | -r      Sample with replacement (optional)
...

You could run sample on a SAM file to sample with replacement from that file.

ADD COMMENT
1
Entering edit mode
6.6 years ago

For fun, here's another generic approach that uses seq, shuf, and sed to sample a text file with replacement:

$ LINES=`wc -l reads.sam | awk '{print $1}'`
$ N=1234
$ for LINE in `seq ${LINES} | shuf -r -n ${N}`; do sed "${LINE}q;d" reads.sam >> sample.sam; done

The seq command generates a list of line numbers, shuf -r samples N of those line numbers with replacement, and sed prints out the line, which is appended to the output.

Jumping around the input file like this might be kinda slow. To get better performance, maybe pipe the list of line numbers to sort -n and then sed on that, to read out an in-order sample:

$ for LINE in `seq ${LINES} | shuf -r -n ${N} | sort -n`; do sed "${LINE}q;d" reads.sam >> sample.sam; done
ADD COMMENT

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6