Hello,
I have read posts regarding how coverage/depth is calculated and tools you can use however I still have questions remaining. I am trying to calculate the mapped average read depth for nanopore direct RNAseq reads and PacBio cDNA reads. I am aligning to a reference transcriptome and genome to see how output may differ.
1) What's the difference between depth and coverage? I have heard the two used interchangeably.
2) How do you calculate average read depth when using a reference transcriptome? Is it calculated with zero coverage regions? or do you need at least 1 mapped transcript since many transcripts won't be expressed? I have long reads and isoforms of different lengths. I would think it's sum(transcript length * count) / (total bp in transcriptome) when including zero coverage regions.
3) Same question as #2 except with a reference genome. Should coverage even be calculated with a reference genome when using RNAseq data?
4) What tool is recommended to calculate both? I have heard samtools depth
or samtools coverage
but again i'm not sure the difference between the two (question #1), the exact calculation being performed (questions #2 & #3), and how this would be different when providing a BAM file with different reference types (transcriptome/genome).
My understanding (anyone correct me if I'm wrong) is that depth refers to the number of reads sequenced, and coverage is often used in genomics to refer how many times the genomic sequence is sequenced in average. The tools you mention do two different things, but
samtools coverage
is more apt for genomics as it gives you general metrics for coverage. I think you would be better off by normalising your reads to something like FPKM (super cool thread on this topic here), but you shouldn't use them to perform differential gene expression analysis. Using FPKM might give you a metric to compare transcriptomes in terms of coverage, taking only into account the fraction of the genome that matters.