I am a co-author on a paper, and they are asking me for read depth of sequencing of a bacterial organism. They did 2x150bp Illumina whole-genome sequencing. Each paired-end fastq for my sample contains 1.33M reads.
My Qualimap and FastQC analyses did not tell me "read depth" but they did tell me the number of sequences per fastq file, and I know the size of the reads (150bp) and the size of the reference genome (4.3Mb).
Given that there are two .fastq files per sample, each sample has 1.33 * 2 = 2.66 M reads. Each read is 150bp, and since there are two (fwd and reverse) runs per sample, there are 399 Mbp sequenced per sample. Divide that by the size of the reference genome at 4.3Mb, makes 93X read depth. Average depth per bp in the genome is sequenced 93 times.
Am I correct? Or, do I not sum the forward & reverse reads in the read count?
Hi Madde,
Yes, you're calculating the theoretical sequencing depth correctly, assuming even distribution of reads across the genome. On the other hand, empirical sequencing depth, which requires aligned reads (e.g., .bam, .cram), takes into account the actual number of reads covering different genomic loci. Empirical sequencing depth can be calculated using Pierre's recommendation of mosdepth, which seems to be more efficient than samtools depth. There's also samtools coverage. When calculating base coverage in-detail, mosdepth and samtools depth may be better.
With a sorted and indexed BAM files, mosdepth can be used to calculate the average sequencing depth and other depth statistics across 500 bp windows for all samples like this:
Here's an example for calculating coverage of a specific region using samtools coverage:
Hope this is helpful.
Maze
Thank you! I will give them my calculations for the theoretical depth. I think my misunderstanding is if the forward and reverse read counts are additive.
I also have empirical coverage calculated by samtools & qualimap BamQC which gives me the mean coverage across the genome. I will communicate with the collaborators that the empirical coverage data is much more accurate and probably what they will want to report, because it calculates depth of coverage of only reads which align to the reference.
Forward and reverse reads represent the same library fragment. So you can't double count.
Just for clarity: assuming uniform coverage of the genome (i.e., reads evenly distributed across all loci) and thus uniform depth in this theoretical scenario, here's how one can calculate theoretical sequencing depth for paired-end sequencing samples where each read is 150bp and there are 1.33e06 fragments total:
Where
If each paired-end sample file contains 1.33e06 reads in its forward file (R1) and 1.33e06 reads in its reverse file (R2), there would be 2.66e06 total reads sequenced, which is then multiplied by 150bp to get the total number of bases sequenced (3.99e08), followed by division by the size (in bp) of the reference genome, calculating to 93X coverage.
Maze
That is theoretical
X
fold coverage for a genome of size 4.3 Mb considering the total number of bases sequenced.It is possible that some bases in your genome may not be covered by a read. So there could be areas of the genome that have no or much lower than 93 reads. On the flip side there will be areas that would be covered > 93x.