Question

Normalizacion RNASeq Doubts

0

Entering edit mode

12 months ago

Julián • 0

Hello, I have a question when you are doing an RNA-Seq analysis of a bacteria, and I get the data after aligning it with bwa and counting the raw counts with featureCounts. I have normalized it with DESEQ2, which I understand is a normalization by library size, should I do another normalization by gene size or is it not necessary?

Also I used a code like this to get the foldchange at 1.5, 1.8 and 2 with a p-value > 0.05 it is ok doing it this way?

# Cargar las librerías necesarias
library(DESeq2)
library(openxlsx)
library(ggplot2)

#  DESeq.ds already created and normalized
# Definir el diseño experimental
design(DESeq.ds) <- ~ condition

# Realizar el análisis de expresión diferencial
DESeq.ds <- DESeq(DESeq.ds)

# Extraer los resultados para la comparación Wt06 vs RpoS06
results_Wt06_vs_RpoS06 <- results(DESeq.ds, contrast = c("condition", "Wt06", "RpoS06"))

# Extraer los resultados para la comparación Wt25 vs RpoS25
results_Wt25_vs_RpoS25 <- results(DESeq.ds, contrast = c("condition", "Wt25", "RpoS25"))

# Convertir los resultados a data frame y añadir nombres de genes
results_df_Wt06_vs_RpoS06 <- as.data.frame(results_Wt06_vs_RpoS06)
results_df_Wt06_vs_RpoS06$Gene <- rownames(results_df_Wt06_vs_RpoS06)
results_df_Wt06_vs_RpoS06 <- results_df_Wt06_vs_RpoS06[, c("Gene", colnames(results_df_Wt06_vs_RpoS06)[-ncol(results_df_Wt06_vs_RpoS06)])]

results_df_Wt25_vs_RpoS25 <- as.data.frame(results_Wt25_vs_RpoS25)
results_df_Wt25_vs_RpoS25$Gene <- rownames(results_df_Wt25_vs_RpoS25)
results_df_Wt25_vs_RpoS25 <- results_df_Wt25_vs_RpoS25[, c("Gene", colnames(results_df_Wt25_vs_RpoS25)[-ncol(results_df_Wt25_vs_RpoS25)])]

# Guardar los resultados sin filtrar en un archivo Excel
write.xlsx(results_df_Wt06_vs_RpoS06, file = "Wt06_vs_RpoS06_unfiltered.xlsx", rownames = FALSE)
write.xlsx(results_df_Wt25_vs_RpoS25, file = "Wt25_vs_RpoS25_unfiltered.xlsx", rownames = FALSE)

# Función para filtrar, contar y devolver los resultados basados en el valor p
filter_and_count <- function(results_df, fold_change_threshold) {
  filtered_results <- results_df[which(results_df$pvalue < 0.05 & abs(results_df$log2FoldChange) > fold_change_threshold), ]
  num_overexpressed <- sum(filtered_results$log2FoldChange > 0)
  num_underexpressed <- sum(filtered_results$log2FoldChange < 0)

  # Mostrar el número de genes sobreexpresados e infraexpresados
  cat("Para el umbral de fold change >", fold_change_threshold, ":\n")
  cat("Número de genes sobreexpresados:", num_overexpressed, "\n")
  cat("Número de genes infraexpresados:", num_underexpressed, "\n")

  return(list(filtered_results = filtered_results, num_overexpressed = num_overexpressed, num_underexpressed = num_underexpressed))
}

# Aplicar los filtros y contar los genes
fc_thresholds <- c(1.5, 1.8, 2)
comparisons <- c("Wt06_vs_RpoS06", "Wt25_vs_RpoS25")

results_list <- list()
for (comparison in comparisons) {
  results_df <- get(paste0("results_df_", comparison))
  for (fc in fc_thresholds) {
    result <- filter_and_count(results_df, fc)
    results_list[[paste(comparison, "FC >", fc)]] <- result
    output_file <- paste0(comparison, "_filtered_pvalue_0.05_fc_", fc, ".xlsx")
    write.xlsx(result$filtered_results, file = output_file, rownames = FALSE)
  }
}

normalization bacteria featureCounts RNA-seq • 932 views

ADD COMMENT • link updated 12 months ago by Ram 45k • written 12 months ago by Julián • 0

0

Entering edit mode

See my answer to a similar quesiton here: Why employ normalization methods, and how can they be utilized in DEG analysis?

ADD REPLY • link 12 months ago by i.sudbery 21k

score 0 · Answer 1 · 2024-07-17

0

Entering edit mode

12 months ago

BioinfGuru ★ 2.1k

Read: The best breakdown of normalisation in my opinion

Just a few points about the DESEq2 steps:

DESeqDataSetFromMatrix() or DESeqDataSet() # already done
design(). Are there any other metadata columns that should be included in the design? What does head(colData(DESeq.ds)) print?
Filter low count reads (not done, but not strictly necessary)
deseq() looks good
results() looks good

deseq2 normalisation with vst/rlog normalizes for sequencing depth, average transcript length (see ATpoint comment below), and models metadata batches included in design (see ATpoint comment below). No need to normalise further. Organism makes no difference, 'bacteria' means nothing to deseq2.

Problem:

The filter_and_count() function wrongly uses pvalue. Use padj instead. So results_df$padj < 0.05

ADD COMMENT • link 12 months ago by BioinfGuru ★ 2.1k

1

Entering edit mode

deseq2 normalisation with vst/rlog normalizes for sequencing depth, transcript length, and library size, and batch effect (if included in design).

Not transcript length. Average transcript length per gene when doing gene level analysis and data were imported via tximport from a preprocessing pipeline that actually returns such a transcript length estimate, for example salmon. There is no actual gene length correction going on. There is also no batch correction. The design merely influences the amount of shrinkage.

By the way, library size and sequencing depth is the same. The important thing is that beyond depth the composition is corrected. See for a long explanation what that is:

ADD REPLY • link 12 months ago by ATpoint 88k

0

Entering edit mode

I just want to check understanding of something ATpoint if you don't mind. To answer OPs question: is normalisation by gene length necessary?

I don't think it is. The full length of a gene never occurs, just the transcripts. So really it is transcript length that is the important factor not gene length. I understand the detection bias toward longer transcripts. So whether it is actual transcript length or average... at least there must be some accounting for transcript length.

Is this a serious misunderstanding?

ADD REPLY • link 12 months ago by BioinfGuru ★ 2.1k

1

Entering edit mode

Situation A: Gene A is longer than gene B. Both have same actual expression level but gene A has higher counts due to being longer. Does DESeq2 account for this? No!

Situation B: A gene is expressed via different isoforms. Some samples express longer isoforms, some shorter ones. Expressing longer isoforms generates higher counts leading (if uncorrected) to the result that the gene could be overexpressed in the samples that express the longer isoforms. Does DESeq2 correct for that? Yes, if as I said above, the preprocessing estimates these "effective transcript lengths" (salmon does) and the aggregation from transcript to gene counts is able to import this information (tximport is). DESeq2 can then use a length offset matrix from tximport to account for differences in effective gene length per sample.

That in B is not a gene length correction. It is a correction for average transcript length expressed per gene. A != B.

Is gene length correction necessary? For DE analysis no, and not even recommended because the relationship between the measured expression level (counts) and the gene dispersions is important during DE analysis. Is it helpful for plots? Yes, when comparing expression between genes and one want no influence of counts simply due to length. Simply dividing gene length (of effective gene length as estimated by e.g. salmon+tximport) can help here.

ADD REPLY • link 12 months ago by ATpoint 88k