Question

How do I calculate read depth of sequencing?

0

Entering edit mode

5 weeks ago

Madde ▴ 20

I am a co-author on a paper, and they are asking me for read depth of sequencing of a bacterial organism. They did 2x150bp Illumina whole-genome sequencing. Each paired-end fastq for my sample contains 1.33M reads.

My Qualimap and FastQC analyses did not tell me "read depth" but they did tell me the number of sequences per fastq file, and I know the size of the reads (150bp) and the size of the reference genome (4.3Mb).

Given that there are two .fastq files per sample, each sample has 1.33 * 2 = 2.66 M reads. Each read is 150bp, and since there are two (fwd and reverse) runs per sample, there are 399 Mbp sequenced per sample. Divide that by the size of the reference genome at 4.3Mb, makes 93X read depth. Average depth per bp in the genome is sequenced 93 times.

Am I correct? Or, do I not sum the forward & reverse reads in the read count?

illumina read-depth WGS • 501 views

ADD COMMENT • link updated 5 weeks ago by mazegriff ▴ 100 • written 5 weeks ago by Madde ▴ 20

1

Entering edit mode

Hi Madde,

Yes, you're calculating the theoretical sequencing depth correctly, assuming even distribution of reads across the genome. On the other hand, empirical sequencing depth, which requires aligned reads (e.g., .bam, .cram), takes into account the actual number of reads covering different genomic loci. Empirical sequencing depth can be calculated using Pierre's recommendation of mosdepth, which seems to be more efficient than samtools depth. There's also samtools coverage. When calculating base coverage in-detail, mosdepth and samtools depth may be better.

With a sorted and indexed BAM files, mosdepth can be used to calculate the average sequencing depth and other depth statistics across 500 bp windows for all samples like this:

for sample in sample1.bam sample2.bam sample3.bam; do
    mosdepth -n --by 500 ${sample%.bam} $sample
done

Here's an example for calculating coverage of a specific region using samtools coverage:

 samtools coverage  -m -r 'chr1:24500000-25600000' --plot-depth -w 32 -A input.bam

Hope this is helpful.

Maze

ADD REPLY • link 5 weeks ago by mazegriff ▴ 100

1

Entering edit mode

Thank you! I will give them my calculations for the theoretical depth. I think my misunderstanding is if the forward and reverse read counts are additive.

I also have empirical coverage calculated by samtools & qualimap BamQC which gives me the mean coverage across the genome. I will communicate with the collaborators that the empirical coverage data is much more accurate and probably what they will want to report, because it calculates depth of coverage of only reads which align to the reference.

ADD REPLY • link 5 weeks ago by Madde ▴ 20

0

Entering edit mode

I think my misunderstanding is if the forward and reverse read counts are additive.

Forward and reverse reads represent the same library fragment. So you can't double count.

ADD REPLY • link 5 weeks ago by GenoMax 147k

0

Entering edit mode

Just for clarity: assuming uniform coverage of the genome (i.e., reads evenly distributed across all loci) and thus uniform depth in this theoretical scenario, here's how one can calculate theoretical sequencing depth for paired-end sequencing samples where each read is 150bp and there are 1.33e06 fragments total:

(2 x 1.33e06 x 150) / 4.3e06 = ~93X

Where

Total bases sequenced / reference genome size = theoretical coverage

If each paired-end sample file contains 1.33e06 reads in its forward file (R1) and 1.33e06 reads in its reverse file (R2), there would be 2.66e06 total reads sequenced, which is then multiplied by 150bp to get the total number of bases sequenced (3.99e08), followed by division by the size (in bp) of the reference genome, calculating to 93X coverage.

Maze

ADD REPLY • link 5 weeks ago by mazegriff ▴ 100

0

Entering edit mode

makes 93X read depth. Average depth per bp in the genome is sequenced 93 times.

That is theoretical X fold coverage for a genome of size 4.3 Mb considering the total number of bases sequenced.

It is possible that some bases in your genome may not be covered by a read. So there could be areas of the genome that have no or much lower than 93 reads. On the flip side there will be areas that would be covered > 93x.

ADD REPLY • link 5 weeks ago by GenoMax 147k

score 0 · Answer 1 · 2024-10-16

0

Entering edit mode

5 weeks ago

Pierre Lindenbaum 164k

Given that there are two .fastq files per sample

but you cannot be sure that your reads are mapped, that there is no clipping, no indel, etc...

use a tool like mosdepth: https://github.com/brentp/mosdepth

ADD COMMENT • link 5 weeks ago by Pierre Lindenbaum 164k