Question

Understanding subsampling with Samtools

1

Entering edit mode

6.3 years ago

m98 ▴ 420

I would like to subsample many bam files. Ideally, I would like to subsample them to an exact number of reads (like seqtk does for fasta/fastq files) but this doesn't seem as straight forward for BAM files.

I am aiming to use samtools view -c but I have a few questions:

I am not very familiar with the concept of seeds (i.e. the 'INT' part in 'INT.FRAC' that samtools requires), or more specifically how to chose a seed? I have about 100 BAM files to subsample and I am only aiming on subsampling them once. So do I chose 0.FRAC for each samples? Or does it have to be a different seed number for each file? I don't really understand how the seeding works.
I have worked out a specific number of reads I would like to keep for each file and ideally. This number is essentially the number of reads in the smallest BAM file out of the 100.

As an example, let's say the smallest BAM file has exactly 22388955 reads. The first BAM I need to scale has 33285268 reads. I first did this command:

#!/usr/bin/bash

# 22388955 / 33285268 = 0.6726386 (according to R)

samtools view -s 0.67 -b data/in.bam > out.scaled.bam
samtools view -c -F 260 out.scaled.bam
# Output: 22295034 (22388955-22295034 = 93921 --> off by 93921 reads)

# Same as above but this time with 8 decimal places in the FRAC part of -s
samtools view -s 0.6726386 -b data/in.bam > out.scaled.bam
samtools view -c -F 260 out.scaled.bam
# Output: 22384016 (22388955-22384016 = 4939 --> off by 4939 reads)

# Same as above but this with with even more decimal places in the FRAC part of -s
samtools view -s 0.672638567909383 -b data/in.bam > out.scaled.bam
samtools view -c -F 260 out.scaled.bam
# Output: 22384008 (22388955 - 22384008 = 4947 --> off by 4947 reads)

# Going back to 8 decimal places but using seed 1 instead of seed 0
samtools view -s 1.6726386 -b data/in.bam > out.scaled.bam
samtools view -c -F 260 out.scaled.bam
# Output: 22384288 (22388955 - 22384288 = 4667 --> off by 4667 reads)

# Going back to 8 decimal places but using seed 2 instead of seed 1
samtools view -s 1.6726386 -b -b data/in.bam > out.scaled.bam
samtools view -c -F 260 out.scaled.bam
# Output: 22385420 (22388955-22385420 = 3535 --> off by 3535 reads)

So from I can see:

The more decimal numbers you have in your FRAC, the more accurate your scaling will be: the number of reads in the out.scaled.bam will get closer to the exact number I am looking for.
However, this only seems to work until a certain point - having 15 decimal numbers doesn't improve much compared to 8 decimal places.
Using a different seed can somehow give you different read numbers despite the FRAC being the same

I guess my questions are:

Is this a bug? Am I missing something?
Is there any way I can end up with the exact number of reads I want: 22388955
Am I missunderstanding something about the way that seeding works?

Thanks.

UPDATE

The reason I want to subsample my BAM files is that I want to convert these files to bigwig format and display them as tracks in UCSC genome browser. I suspect having exactly the same read counts may not matter that much in this context but I would still like to understand why I can't get the numbers I want.

RNA-Seq subsampling samtools seed reads • 8.1k views

ADD COMMENT • link updated 3.3 years ago by boczniak767 ▴ 870 • written 6.3 years ago by m98 ▴ 420

2

Entering edit mode

Not answering your question but if you need 22388955 reads then with reformat.sh from BBMap suite you say samplereadstarget=22388955.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.

ADD REPLY • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

But does it really choose alignments randomly?

ADD REPLY • link 3.3 years ago by boczniak767 ▴ 870

1

Entering edit mode

From what I understand, the seed is useful for repeated sub-sampling of a single BAM file. What it does - and I'm speculating here - ~~is it partitions a BAM into groups, assigns each a seed number and sub-samples from a particular group. When the seed number changes, sub-sampling is done from a different partition/group.~~ EDIT it uses the seed as a parameter to a random number generator, so I can't really speculate on the algorithm behind how the seed affects the reads picked, except for the fact that in random number generation, different seed usually results in a different set of random numbers.

About your fraction, I would guess samtools cannot get any more precise than 8 decimal points. Like genomax points out, there are other tools that offer better precision.

ADD REPLY • link 6.3 years ago by Ram 44k

0

Entering edit mode

Can you re-post (this info may be in one of your previous questions as I recall you referring to it) your logic behind doing this? How is the sub-sampling going to help your end goal?