Question

Base count NOT read count - Downsampling to specific number of bases not reads?

0

Entering edit mode

5.1 years ago

jnowacki ▴ 110

How to I down-sample to say 13 million bases total?

Quick summary: I'm trying to get even depth of coverage and each fastq file has different read lengths & reference size.

More details: I have 60 samples and each fastq file has slightly different read length. So down-sampling by read count becomes complicated and each sample needs a different number of reads. I also have multiple bed files and I'm trying to make sure each bed file has exactly 13x coverage when I align the read to the reference.

I could automate this and calculate average read length of each fastq file, do the math and then dynamically downsample that fastq file to the appropriate levels. However, I would hope there was an easier way and I can simply just down-sample to a specific base count.

seqtk downsample Reads bases fastq • 1.6k views

ADD COMMENT • link updated 5.1 years ago by GenoMax 148k • written 5.1 years ago by jnowacki ▴ 110

0

Entering edit mode

So...your plan is to map, look at per-base coverage across a set of targets or something, then downsample the reads so you have an absolute known coverage with zero variability? I'm confused by your calculations - are you just saying that sample 1 has X reads so, given a mean length of Y bases/read, you can assume average depth of 13x across all positions of interest? But, you know there will be some variability due to, say, read QC or something, so you want to do the calculations per-base as opposed to per-read? I also don't know why you'd have a variable reference size here, so I'm lost with respect to that detail as well.

Why not try something like bbtools' BBNorm?

ADD REPLY • link 5.1 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

I just want to guarantee 13x coverage.

I'm comparing the performance of multiple wet lab protocols. The following variables changed between samples:

Reference: (hg19, hg38, etc)
Bed file:
multiple capture technologies
Read length
Wet lab protocols

The project was designed by a team of PhD's and I'm just the guy processing the data. I won't know what they are looking at/for until I see the final presentation.

ADD REPLY • link 5.1 years ago by jnowacki ▴ 110

0

Entering edit mode

Thanks for clarifying. I've never seen an absolute requirement this stringent, so I was just curious as to the purpose. Genomax' solution below is a good starting place.

ADD REPLY • link 5.1 years ago by Brice Sarver ★ 3.8k

score 1 · Answer 1 · 2019-10-24

1

Entering edit mode

5.1 years ago

GenoMax 148k

Use reformat.sh from BBMap suite with the following option. There are a ton of other options you can check via in-line help.

samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

ADD COMMENT • link 5.1 years ago by GenoMax 148k

0

Entering edit mode

This seems to work. Thank you! The number seem to line up extremely well when confirmed with this method:

awk 'BEGIN{sum=0;}{if(NR%4==2){sum+=length($0);}}END{print sum;}' file.fq

And off by more than 1% when confirmed with this method:

cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c

As denoted here:

Counting Number Of Bases In A Fastq File

I may dig in deeper as the "inaccurate" method is the one that's the most up-voted. But thank you!

ADD REPLY • link 5.1 years ago by jnowacki ▴ 110