How to I down-sample to say 13 million bases total?
Quick summary: I'm trying to get even depth of coverage and each fastq file has different read lengths & reference size.
More details: I have 60 samples and each fastq file has slightly different read length. So down-sampling by read count becomes complicated and each sample needs a different number of reads. I also have multiple bed files and I'm trying to make sure each bed file has exactly 13x coverage when I align the read to the reference.
I could automate this and calculate average read length of each fastq file, do the math and then dynamically downsample that fastq file to the appropriate levels. However, I would hope there was an easier way and I can simply just down-sample to a specific base count.
So...your plan is to map, look at per-base coverage across a set of targets or something, then downsample the reads so you have an absolute known coverage with zero variability? I'm confused by your calculations - are you just saying that sample 1 has X reads so, given a mean length of Y bases/read, you can assume average depth of 13x across all positions of interest? But, you know there will be some variability due to, say, read QC or something, so you want to do the calculations per-base as opposed to per-read? I also don't know why you'd have a variable reference size here, so I'm lost with respect to that detail as well.
Why not try something like
bbtools'
BBNorm?I just want to guarantee 13x coverage.
I'm comparing the performance of multiple wet lab protocols. The following variables changed between samples:
The project was designed by a team of PhD's and I'm just the guy processing the data. I won't know what they are looking at/for until I see the final presentation.
Thanks for clarifying. I've never seen an absolute requirement this stringent, so I was just curious as to the purpose. Genomax' solution below is a good starting place.