Sub-sampling a BAM to a fixed number of reads
1
0
Entering edit mode
18 months ago
matt81rd ▴ 10

Hi all,

I have recently been trying to write a simple python script that can produce contaminated synthetic paired end reads for a tool my group are creating. However, in writing this script I've been using samtools view -s to sample the generated BAM files (one BAM that acts as the contaminate and one that acts as the base). I was running into problems where the final merged BAM would have the wrong proportions of reads (so despite me saying I only wanted 40% of my reads to be from teh contaminate there were more like 46%) despite me using the above and I realised that samtools view -s doesn't actually sub sample to a fixed number of reads. Does anyone know of a better way of doing this? I have read around and Picard or reformat.sh from BBsuiite are mentioned quite a lot?.

The full workflow works something like below:

  • Generate synthetic reads for the Base and contaminate fasta (this includes most importantly the BAM file).
  • Count the reads of both BAM files.
  • Using the read counts I downsample the larger BAM file so they are both the same size (usually downsamples the contaminate BAM) using again samtools view -s
  • Now using samtools view -s subsample both BAM to the desiered percentages (e.g. 40% from the contam BAM and 60% from the Base BAM)
  • Merge the two sampled BAM files using samtools cat
  • Sort the merged BAM file sing samtools sort
  • Convert the sorted BAM file into paired end fastq reads using samtools fastq

To provide some more context here are the read counts for all the releveant BAM files through the above workflow:

  • Contaminate_bam = 1665618
  • base_bam = 1724414
  • downsampled_contam_bam = 1272664
  • sampled_downsampled_contam_40 = 665470 (this is the step producing the error as 40% of the downsampled reads sould be approx 50k reads but samtools view -s is produing a BAM with higher reads)
  • sampled_base_60 = 764702 (again 60% of the base_bam should be approx 100k reads but I'm getting considerably less)
  • merged_base_contam = 1430172

Any help would be great.

samtools BAM • 2.4k views
ADD COMMENT
2
Entering edit mode
18 months ago
GenoMax 147k

Unless this is an exercise you need to do with samtools, you could use reformat.sh from BBMap suite to do this. seqtk may also work.

reformat.sh -Xmx4g in=in.bam out=out.bam samplereadstarget=NN

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
ADD COMMENT
0
Entering edit mode

Thanks for this! I guess there isn't anyway to specify a percentage of reads like with: samtools view -s 0.4 taking 40%? Instead you just specify the exact nuber of reads with reformat.sh?

ADD REPLY
1
Entering edit mode

You could use samplerate=0.4 for that. The number may not be absolutely exact though.

ADD REPLY
0
Entering edit mode

Ah yes! Sorry totally missed that in your first message! Will have a play round and see if i can swap it into my script.

ADD REPLY

Login before adding your answer.

Traffic: 2259 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6