Hi all,
I have recently been trying to write a simple python script that can produce contaminated synthetic paired end reads for a tool my group are creating. However, in writing this script I've been using samtools view -s
to sample the generated BAM files (one BAM that acts as the contaminate and one that acts as the base). I was running into problems where the final merged BAM would have the wrong proportions of reads (so despite me saying I only wanted 40% of my reads to be from teh contaminate there were more like 46%) despite me using the above and I realised that samtools view -s doesn't actually sub sample to a fixed number of reads. Does anyone know of a better way of doing this? I have read around and Picard or reformat.sh from BBsuiite are mentioned quite a lot?.
The full workflow works something like below:
- Generate synthetic reads for the Base and contaminate fasta (this includes most importantly the BAM file).
- Count the reads of both BAM files.
- Using the read counts I downsample the larger BAM file so they are both the same size (usually downsamples the contaminate BAM) using again
samtools view -s
- Now using
samtools view -s
subsample both BAM to the desiered percentages (e.g. 40% from the contam BAM and 60% from the Base BAM) - Merge the two sampled BAM files using
samtools cat
- Sort the merged BAM file sing
samtools sort
- Convert the sorted BAM file into paired end fastq reads using
samtools fastq
To provide some more context here are the read counts for all the releveant BAM files through the above workflow:
- Contaminate_bam = 1665618
- base_bam = 1724414
- downsampled_contam_bam = 1272664
- sampled_downsampled_contam_40 = 665470 (this is the step producing the error as 40% of the downsampled reads sould be approx 50k reads but samtools view -s is produing a BAM with higher reads)
- sampled_base_60 = 764702 (again 60% of the base_bam should be approx 100k reads but I'm getting considerably less)
- merged_base_contam = 1430172
Any help would be great.
Thanks for this! I guess there isn't anyway to specify a percentage of reads like with:
samtools view -s 0.4
taking 40%? Instead you just specify the exact nuber of reads with reformat.sh?You could use
samplerate=0.4
for that. The number may not be absolutely exact though.Ah yes! Sorry totally missed that in your first message! Will have a play round and see if i can swap it into my script.