Question

Normalizing RNA read counts for each gene?

0

Entering edit mode

2.3 years ago

pearl2070 ▴ 10

While reading a paper, I encountered the line "Heatmap of normalized read counts for chlorophyll-protein complexes (LHCs) normalized by row to reflect relative expression of each gene (see scale bar)" and was wondering if someone would be able to clarify how the normalization by row might have been done?

I'm also curious how they binned by genus before doing DESeq, if anyone might have any input on that!

Full paper link: https://www.nature.com/articles/s41396-023-01416-x

From the methods section of the paper: "Relative abundance of taxonomic groups To assign taxonomic identification to each contig, a custom local protein database was used, incorporating: the re-annotated [91] MMETSP [92]; genes from Tara Ocean contigs with taxonomic identifiers identified by MetaEuk [93]; MaTOU [45]; and Uniref100 (updated May 2020) databases. Proteins were aligned with DIAMOND (v. 2.0.11, ultra-sensitive mode) [94]. A bit-score cutoff of 50 was used. The phylogeny assigned to the top bit- score of each alignment in this custom database was kept for further analysis. One hit for each phylum was kept if a protein sequence mapped to multiple algal phyla, given contigs could have been assembled from multiple phylogenetic phyla. The top twenty representative genera were determined based on summed transcripts per million (TPM) per sample. TPM was calculated by dividing the read counts of each transcript by the putative gene length in kilobases (reads per kilobase). Every read per kilobase was summed for each sample and divided by 1,000,000 to obtain a scaling factor. Reads per kilobase were divided by each sample's scaling factor to obtain TPM corresponding to each read and sample. Differential expression analysis Transcripts were grouped at the genus level before passing raw counts to DESeq2 [95] (version 1.28.1) in order to correct for the effect of one taxonomic group increasing in biomass or expressing higher amounts of mRNA per cell between sampling dates; this is has been shown to skew the amount of up and downregulated genes [96]. DESeq2 was used to determine if a gene was significantly changed (False discovery rate/FDR adjusted p value <0.001 and log2 fold change > |1|). Three groups of comparisons were used to search for significant changes in gene expression within the deeply mixed layer; within the shallow mixed layer; and between respective depths for deep and shallow mixed layers (Fig. 1F). Normalized read values from DESeq2 were used to make heatmaps from subsets of significantly altered genes (Figs. 2, 4, 5, 6 and 8)."

normalization RNA-seq metatranscriptomes • 1.4k views

ADD COMMENT • link updated 2.3 years ago by Trivas ★ 1.9k • written 2.3 years ago by pearl2070 ▴ 10

0

Entering edit mode

Figure the sentence of interest is from: enter image description here

ADD REPLY • link 2.3 years ago by pearl2070 ▴ 10

score 2 · Accepted Answer · 2023-05-10

I'm also curious how they binned by genus before doing DESeq, if anyone might have any input on that!

They aligned the transcripts (or protein products obtained from TransDecoder) against a protein database (they used DIAMOND for the alignment) and then assigned the phylogeny based on the top bit-score of each alignment. Based on the taxonomy, now you can bin the transcripts based on the genus to which they belong.

score 2 · Accepted Answer · 2023-05-10

2

Entering edit mode

2.3 years ago

chaco001 ▴ 40

It looks like they started with normalized reads made by the variance stabilizing transform in DESeq2. So probably starting with this:

library(DESeq2)
vsd <- varianceStabilizingTransformation(dds)
newcounts <- assay(vsd)

Then, from eye-balling that figure, it looks like the per-row (per-gene) data are scaled to be between 0-1. So for each row, they probably did (reads - min(reads)) / (max(reads) - min(reads)). This could be put together with some apply commands.

ADD COMMENT • link 2.3 years ago by chaco001 ▴ 40

0

Entering edit mode

Is the reads in the formula, ((#reads of one gene across all samples) - min(#reads of that gene)) / (max(#reads of that gene) - min(#reads of that gene)) ? What would be the benefit of doing this-- making gene-to-gene comparisons less biased by accounting for the differences in how frequently each gene is expressed in the whole dataset?

ADD REPLY • link 2.3 years ago by pearl2070 ▴ 10

1

Entering edit mode

Heatmaps have a single color scale. If you have one gene that has counts between 300-500 but most of your other genes are 10-50, you won't be able to visually compare the difference between the 10-50 genes because your scale is skewed by the 300-500 count gene. Doing the per-row normalization removes that so you can easily visualize the differences across samples.

ADD REPLY • link 2.3 years ago by Trivas ★ 1.9k