I am trying to count gene expression for differential expression analysis. I used salmon to quantify genes against a reference transcriptome to get their TPM, but also --writeMappings to check their max read depth. I got:
Name EffectiveLength TPM NumReads MaxDepth
NR_146118.1 4,925 177,186 31,949,345 6
NR_132964.1 24 124,223 109,585 111,527
MaxDepth for NR_132964.1 was amended from 8,038 to 111,527 after comments pointing out that samtools without argument -d 0
limits MaxDepths to about 8000.
MaxDepth is the maximum read depth from samtools view on the output of --writeMappings. The rest of the columns are from salmon's quant.sf output file.
I understand why there is not a linear relationship between NumReads and TPM. But why are their MaxDepths so different?
Some more data. For MaxDepth == NA, samtools depth found no depth there---which shouldn't be the case for these top few expressed genes.
Name Length EffectiveLength TPM NumReads MaxDepth
1 U13369.1 42999 42847. 28.0 43870. 8034
2 NM_001190452.1 1555 1403. 12043. 618711. 8028
3 NM_001190470.1 1036 884. 13554. 438843. 8035
4 NM_001190476.1 1231 1079. 1531. 60512. 8005
5 NM_001190487.2 1395 1243. 1262. 57434. 8022
6 NM_001190702.1 1290 1138. 20012. 833999. 8036
7 NM_001190706.1 527 376. 13917. 191698. 4476
8 NM_001190708.1 1121 969. 3952. 140262. 8024
9 NR_003285.3 157 39.1 151275. 216553. 57
10 NR_003286.4 1869 1717. 777. 48851. 8000
11 NR_132964.1 128 24.1 124223. 109585 8038
12 NR_145819.1 13351 13199. 45.5 21968. 8001
13 NR_145822.1 5066 4914. 825. 148378. NA
14 NR_146117.1 13373 13221. 222. 107270. 8016
15 NR_146118.1 5077 4925. 177186. 31949345. 6
16 NR_146119.1 1869 1717. 439340. 27622373. 122
17 NR_146120.1 156 38.5 23238. 32776. NA
18 NR_146144.1 13315 13163. 1264. 608993. 8009
19 NR_146148.1 5054 4902. 649. 116444. NA
20 NR_146154.1 5055 4903. 2997. 538036. NA
It has been pointed out in the comments that many of the MaxDepths around the 8000s is due to the default depth limit of samtools view. Nonetheless, I'd still wonder why some of these transcripts with high TPM and NumReads don't show any depth.
probably because gene have different expression level?
Sure, but wouldn't their maximum read depths be roughly proportional to their TPM?
TPM and Depth are calculated using different algorithm for instance TPM as well as RPKM are counts based on the transcript length and read numbers for that gene. The lower TPM and the slightly higher depth might be due to a longer transcript. That's the explanation that I have, but I'm not the best bioinformatician around ;)
I understand your explanation, but I have some doubts as to the likelihood that this difference would cause such a counterintuitive and drastic difference in read depths. Thank nonetheless, though.
I cannot answer your question but for differential expression you do not need TPM. Also you would not need to look into the files. The recommended next step would be to use the
tximport
package to summarize the transcript level quantifications to the gene level followed by differential analysis with e.g.DESeq2/edgeR/limma
.Still I will tag Rob (the developer of
salmon
). Please add the command you used to get this MaxDepth column.Thank you. I obtained the MaxDepth column by using
samtools depth <ALIGNMENT> -r <REGION>
, where <alignment> is the indexed and sorted file output bysalmon ... --writeMappings=<ALIGNMENT>
. [1] Columns are combined with a simple R script [2].In my next step, I use
edgeR
to import gene counts inquant.sf
using the first and fourth columns (Name, TPM). That is, I use the R command[1]
[2]
tximport
?samtools depth
?tximport
. Obviously, I should be using that, but... I think the apparent discrepancy between maximum read depth and TPM still hold, even if we are looking at a transcript rather than gene level.samtools 1.9
,Using htslib 1.9
. No, I do not pass any option tosamtools depth
besides specifying the region. See [1] above, or the Makefile extract:You should set
-d
to0
when runningsamtools depth
otherwise it is using 8000 and that may be leading to odd numbers you are getting.You may also want to look at
mosdepth
as a fast alternative (https://github.com/brentp/mosdepth ).What I think you're observing is a result of the default ~8000 depth in
samtools depth
andsamtools mpileup
. In essence, this is the size of a buffer for holding reads and once the depth goes much above this you're going to start seeing some odd behavior due to that (I don't recall the details of this edge case, I'd have to check the source code to see how it's handled).