Example:
baseMean
Gene A 20000.00
Gene B 80.00
Gene C 0
Refresher
- baseMean: 'The values above are the average of the normalized count values, dividing by size factors, taken over all samples, normalizing for sequencing depth. It does not take into account gene length. The base mean is used in DESeq2 only for estimating the dispersion of a gene (it is used to estimate the fitted dispersion). For this task, the range of counts for a gene is relevant but not the gene's length (or other technical factors influencing the count, like sequence content).'
- Gene length: 'Accounting for gene length is necessary for comparing expression between different genes within the same sample.'
My questions:
- Is the baseMean value the final dispersion estimates before fitting the GLM model and testing?
- Observation: 'Gene A' has the highest transcript count value, 'Gene B' the lowest, 'Gene C' was not identified in the data across all samples. Is this correct without making a comparison of a single gene between the samples? For comparison of a 'Gene A' between samples the log2FoldChange value is used and padj estimates significance.
- Would a combination of baseMean and log2FoldChange be useful to determine if a gene is highly present (expressed?) in all samples and differentially expressed between samples? Essentially, does baseMean = level of transcript (expression?) overall?
Thank you in advance!
Very interesting. I have also used salmon count matrices for this dataset however, did not consider gene length so thoroughly before posting. Thank you for this input!
Perfect, then you can also gather the TPM information after using the
tximport
function :) Glad, if that helped