While reading a paper, I encountered the line "Heatmap of normalized read counts for chlorophyll-protein complexes (LHCs) normalized by row to reflect relative expression of each gene (see scale bar)" and was wondering if someone would be able to clarify how the normalization by row might have been done?
I'm also curious how they binned by genus before doing DESeq, if anyone might have any input on that!
Full paper link: https://www.nature.com/articles/s41396-023-01416-x
From the methods section of the paper: "Relative abundance of taxonomic groups To assign taxonomic identification to each contig, a custom local protein database was used, incorporating: the re-annotated [91] MMETSP [92]; genes from Tara Ocean contigs with taxonomic identifiers identified by MetaEuk [93]; MaTOU [45]; and Uniref100 (updated May 2020) databases. Proteins were aligned with DIAMOND (v. 2.0.11, ultra-sensitive mode) [94]. A bit-score cutoff of 50 was used. The phylogeny assigned to the top bit- score of each alignment in this custom database was kept for further analysis. One hit for each phylum was kept if a protein sequence mapped to multiple algal phyla, given contigs could have been assembled from multiple phylogenetic phyla. The top twenty representative genera were determined based on summed transcripts per million (TPM) per sample. TPM was calculated by dividing the read counts of each transcript by the putative gene length in kilobases (reads per kilobase). Every read per kilobase was summed for each sample and divided by 1,000,000 to obtain a scaling factor. Reads per kilobase were divided by each sample's scaling factor to obtain TPM corresponding to each read and sample. Differential expression analysis Transcripts were grouped at the genus level before passing raw counts to DESeq2 [95] (version 1.28.1) in order to correct for the effect of one taxonomic group increasing in biomass or expressing higher amounts of mRNA per cell between sampling dates; this is has been shown to skew the amount of up and downregulated genes [96]. DESeq2 was used to determine if a gene was significantly changed (False discovery rate/FDR adjusted p value <0.001 and log2 fold change > |1|). Three groups of comparisons were used to search for significant changes in gene expression within the deeply mixed layer; within the shallow mixed layer; and between respective depths for deep and shallow mixed layers (Fig. 1F). Normalized read values from DESeq2 were used to make heatmaps from subsets of significantly altered genes (Figs. 2, 4, 5, 6 and 8)."
Figure the sentence of interest is from: