I would like to randomly sample with replacement from bam files. One way to do that would be to convert the bam files to sam format and use python's np.random.choice. Is there a more direct way to do this that doesn't require bam to sam conversion? Something like samtools view -s but with replacement?
I am exploring my data and wanted to measure the shot noise or variance in counts (number of reads per genomic region) upon resampling with replacement. As expected, it followed a Poisson distribution, which was nice to see. I can now add this variance to the biological variance for downstream analyses.
By sampling with replacement, I mean that one could draw the same read more than once by random chance, as opposed to sampling without replacement, where a read can only be drawn once from a population, which is what I think samtools view -s does.
For example, if I wanted to randomly sample 10 numbers between 1-10, without replacement I would get 1,2,3,4,5,6,7,8,9,10 whereas with replacement I might get, 1,1,2,4,4,5,7,7,8,9.
Thanks, Devon! I was able to build on your code with minor modifications. Using pysam was a good idea. I just had to remember to keep my bam index files in the same directory!
Assuming that you can't find a tool that works natively with BAM without converting to SAM, I have a tool called sample that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.
For fun, here's another generic approach that uses seq, shuf, and sed to sample a text file with replacement:
$ LINES=`wc -l reads.sam | awk '{print $1}'`
$ N=1234
$ for LINE in `seq ${LINES} | shuf -r -n ${N}`; do sed "${LINE}q;d" reads.sam >> sample.sam; done
The seq command generates a list of line numbers, shuf -r samples N of those line numbers with replacement, and sed prints out the line, which is appended to the output.
Jumping around the input file like this might be kinda slow. To get better performance, maybe pipe the list of line numbers to sort -n and then sed on that, to read out an in-order sample:
$ for LINE in `seq ${LINES} | shuf -r -n ${N} | sort -n`; do sed "${LINE}q;d" reads.sam >> sample.sam; done
You got me curious, why do you want to sample with replacement?
I am exploring my data and wanted to measure the shot noise or variance in counts (number of reads per genomic region) upon resampling with replacement. As expected, it followed a Poisson distribution, which was nice to see. I can now add this variance to the biological variance for downstream analyses.
Could you explain what do you mean by
replacement
?By sampling with replacement, I mean that one could draw the same read more than once by random chance, as opposed to sampling without replacement, where a read can only be drawn once from a population, which is what I think samtools view -s does.
For example, if I wanted to randomly sample 10 numbers between 1-10, without replacement I would get 1,2,3,4,5,6,7,8,9,10 whereas with replacement I might get, 1,1,2,4,4,5,7,7,8,9.
Something like this