Question

gene expression normaliztion

0

Entering edit mode

6 months ago

bioinfo1994 ▴ 20

I want to perform gene expression analysis on my RNA-seq data. In this experiment, I only have one sample with three replicates and no control or additional samples.

sample_rep1.bam
sample_rep2.bam
sample_rep3.bam

How should I normalize the data? I'm currently using edgeR's calcNormFactors, but it utilizes TMM, which normalizes between samples by adjusting library sizes without accounting for gene length, which is crucial in my case.

I need to obtain absolute gene expression values and compare gene expression within the same sample, so it's important to consider gene length. However, I'm uncertain about how to utilize the three replicates, as I believe I still need to normalize for library size across them. I would be grateful if anyone has a workflow for a situation like this and could kindly share it with me.

Thank you!

edgeR RNA-seq gene-expression • 953 views

ADD COMMENT • link updated 5 months ago by Gordon Smyth ★ 7.7k • written 6 months ago by bioinfo1994 ▴ 20

0

Entering edit mode

6 months ago

Buffo ★ 2.4k

You need to calculate TPM, RPKMs, or FPKMs (whatever fits better for your data), which are the abundances normalized by million reads and gene/transcript length.

ADD COMMENT • link 6 months ago by Buffo ★ 2.4k

0

Entering edit mode

Thank you. If I only had one replicate, I wouldn't have questioned the necessity of using these methods. However, considering I have three replicates with varying library sizes, should I merge them? Alternatively, should I proceed as follows:

group <- factor(c("sample_rep1", "sample_rep2", "sample_rep3"))
y <- DGEList(counts = count_matrix, group = group)
y = calcNormFactors(y) # do I need to use this?
final_df = as.data.frame(fpkm(y))

Then the top 200 rows are the highest expressed genes, and I don't need to do any further normalization or model fitting?

ADD REPLY • link 6 months ago by bioinfo1994 ▴ 20

0

Entering edit mode

No, you don't. The calculation of TPM or FPKM already includes the normalization (per million factor). StringTie should be the easiest and fastest way to do it.

ADD REPLY • link 6 months ago by Buffo ★ 2.4k

0

Entering edit mode

6 months ago

ATpoint 86k

The critical part is to obtain a robust length estimate per gene. Gene length (or transcript length) is not just start to end of the gene (or exon) bodies, but must respect which isoforms, and in which abundance, are actually expressed. Normalization itself can simply be what you already did, and then divide these values by the length estimates. For the quatification I recommend salmon which will give a length estimate per isoform. You can then use tximport (R/Bioc) to aggregate that to gene level. This will summarize the length estimates per transcript to a single average value per gene and sample. That outperforms alignment-based methods such as STAR+featureCounts which do not estimate a per-sample length per gene but rather take the same length information for all samples, not respective differential isoform abundances.

ADD COMMENT • link 6 months ago by ATpoint 86k

score 1 · Accepted Answer · 2024-06-23

1

Entering edit mode

6 months ago

Gordon Smyth ★ 7.7k

Your data consists just of three replicates, so there is only one group rather than three as in the code in your comment to Buffo. To get normalized FPKM in edgeR combined over the three replicates:

y <- DGEList(count_matrix)
y <- normLibSizes(y)
FPKM <- rpkmByGroup(y, gene.length=GeneLength)

This code will treat the three replicates as one group. It estimates the FPKM for the group by fitting a negative binomial GLM to the three samples with an intercept only. This automatically adjusts for everything, including varying library sizes between the replicates.

But you need to have a measure of gene length (GeneLength) for each row of the count matrix, which would usually be obtained from the same software that created the genewise read counts in the first place.

Note that normLibSizes is the new name for calcNormFactors. It is the same function, just with a new name to make its purpose clearer.

ADD COMMENT • link 6 months ago by Gordon Smyth ★ 7.7k

0

Entering edit mode

Thank you, your assistance was very helpful. If I have more samples, each with 3 replicates, and I want to analyze the expression of a few specific genes, how should I proceed? None of the samples are controls, and I believe differential expression analysis isn't necessary. I just need to see if some genes are highly expressed and others are low across all samples. If I'm correct, absolute gene counts should suffice for this task. Should I create a DGElist with the group factor, normalize, and then use RPKM by group?

    group <- factor(c("sample_1", "sample_1", "sample_1", "sample_2", "sample_2", "sample_2")) # each sample has three replicates
    y <- DGEList(counts = count_matrix, group = group) #count_matrix will specify that I have 3 rep for each sample
    y <- normLibSizes(y)
    RPKM <- rpkmByGroup(y, gene.length=GeneLength)

Additionally, if I want to calculate log2 TPM, does it make sense to proceed as follows?:

  tpm_from_rpkm <- function(x) {
  rpkm.sum <- colSums(x)
  return(t(t(x) / (1e-06 * rpkm.sum)))}

  TPM_counts <- tpm_from_rpkm(RPKM)
  TPM_counts_df = as.data.frame(TPM_counts)
  TPM_counts_log = log(TPM_counts_df + 1)

Thank you for your help!

ADD REPLY • link 5 months ago by bioinfo1994 ▴ 20

0

Entering edit mode

Yes, you can use rpkmByGroup().

I don't recommend scaling the RPKMs to add up to 1e6 (which is all your RPM calculation is doing). That will destroy the TMM normalization and adds no extra information.

ADD REPLY • link 5 months ago by Gordon Smyth ★ 7.7k