Question

When I change samples grouping for normalization by DESeq2 I get different results, Why?

0

Entering edit mode

3.0 years ago

Zahra ▴ 110

Hi all

I have downloaded the raw counts of RNA-Seq from TCGA for DE analysis. the number of primary tumor samples was 223 and the normal adjacent tissue was 41. I have performed the normalization by DESeq2 in two ways:

first, I normalized all of the primary tumor samples(223) and normal samples(41), then I visualized them by volcano plot that you can see here: enter image description here

Second, I separated the matched samples (41 tumor and 41 adjacent normal samples) and then normalized them. as you can see below the plot has changed significantly :

enter image description here My question is that which one is correct and why? when I studied how DESeq2 normalizes data here I can't understand the effect of grouping in the normalization. I mean what is the effect of group size on normalization by DESeq2?

Thanks for any help.

DESeq2 TCGA volcano RNA-Seq • 1.8k views

ADD COMMENT • link updated 3.0 years ago by Friederike 9.0k • written 3.0 years ago by Zahra ▴ 110

1

Entering edit mode

Can you post the code for both workflows so that it's easier to understand what was done?

ADD REPLY • link 3.0 years ago by Friederike 9.0k

0

Entering edit mode

Dear Friederike,

thanks for your reply. When I wanted to post my code here, I found a bad mistake in my code that change all my results, so I've edited the second figure. However, my previous question still remains: which normalization is correct? here is my code:

First strategy

query <- GDCquery(project = "TCGA-COAD" ,
                      data.category = "Transcriptome Profiling" ,
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts" ,
                      sample.type = c("Primary Tumor", "Solid Tissue Normal"), 
                      experimental.strategy = "RNA-Seq")

    GDCdownload(query, method = "api")

    query.counts.colon <- GDCprepare(query)

    Colon.Matrix <- as.data.frame(SummarizedExperiment::assay(query.counts.colon))

    #sample grouping 
    dataSmTP <- TCGAquery_SampleTypes(getResults(query,cols="cases"),"TP")
    dataSmNT  <- TCGAquery_SampleTypes(getResults(query,cols="cases"),"NT") 

    #building an ordered matrix based on grouping 
    NormalGroup <- Colon.Matrix[,dataSmNT]
    TumorGroup <- Colon.Matrix[,dataSmTP]
    Sorted.Matrix <- cbind(NormalGroup,TumorGroup)

    #Normalization by DESeq2
    gr <- factor(c(rep("Normal", length(dataSmNT)) , rep("Tumor", length(dataSmTP))))
    colData <- data.frame(group=gr, type="paired-end")
    cds <- DESeqDataSetFromMatrix(Sorted.Matrix, colData, design= ~group)
    cds <- DESeq(cds)
    cnt <- log2(1+counts(cds, normalized=TRUE))

    dif <- data.frame(results(cds, c("group", "Normal" , "Tumor")))
    dif$padj <- p.adjust(dif$pvalue, method = "BH")
    dif <- dif[order(dif$padj),]


    #volcano plot
    ggplot(dif, aes(log2FoldChange, -log10(pvalue) , color=log2FoldChange))+geom_point()+theme_bw()

Second:

#Here is my code for matched samples
query <- GDCquery(project = "TCGA-COAD" ,
                  data.category = "Transcriptome Profiling" ,
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts" ,
                  sample.type = c("Primary Tumor", "Solid Tissue Normal"), 
                  experimental.strategy = "RNA-Seq")

GDCdownload(query, method = "api")

query.counts.colon <- GDCprepare(query)

Colon.Matrix <- as.data.frame(SummarizedExperiment::assay(query.counts.colon ))

#sample grouping 
dataSmTP <- TCGAquery_SampleTypes(getResults(query,cols="cases"),"TP")
dataSmNT  <- TCGAquery_SampleTypes(getResults(query,cols="cases"),"NT") 


#comparing barcodes for normal and tumor samples
#separating the matched samples based on TCGA barcode
mached_tumor <- c()
for (i in 1:length(dataSmNT)){
  split_normal <- unlist(strsplit(dataSmNT [i],"-"))
  for(j in 1:length(dataSmTP)){
    split_tumor <- unlist(strsplit(dataSmTP [j],"-"))
    if(split_normal [2] == split_tumor[2] & split_normal [3] == split_tumor[3] &
       split_normal [6] == split_tumor[6] & split_normal [7] == split_tumor[7]){
      mached_tumor <- c(mached_tumor, dataSmTP[j])
    }
  }
}

#building an ordered matrix based on grouping 
NormalGroup <- Colon.Matrix[,dataSmNT]
TumorGroup <- Colon.Matrix[,mached_tumor]
Sorted.Matrix <- cbind(NormalGroup,TumorGroup)


#Normalization by DESeq2
gr <- factor(c(rep("Normal", length(dataSmNT)) , rep("Tumor", length(mached_tumor))))
colData <- data.frame(group=gr, type="paired-end")
cds <- DESeqDataSetFromMatrix(Sorted.Matrix, colData, design= ~group)
cds <- DESeq(cds)
cnt <- log2(1+counts(cds, normalized=TRUE))

dif <- data.frame(results(cds, c("group", "Normal" , "Tumor")))
dif$padj <- p.adjust(dif$pvalue, method = "BH")
dif <- dif[order(dif$padj),]

#volcano plot
ggplot(dif, aes(log2FoldChange, -log10(pvalue) , color=log2FoldChange))+geom_point()+theme_bw()

ADD REPLY • link 3.0 years ago by Zahra ▴ 110

0

Entering edit mode

You included different samples, why would you expect the results to look the same?

ADD REPLY • link 3.0 years ago by swbarnes2 14k

1

Entering edit mode

You are right, they shouldn't be the same, but indeed I don't know which one is correct.

ADD REPLY • link 3.0 years ago by Zahra ▴ 110

1

Entering edit mode

So, basically you're asking whether you should compare 223 primary tumor samples vs. 41 normal samples or whether you should focus on the matched samples (41 vs 41) only?

While neither one is THE correct solution, I'd say that comparing matched samples is preferable, but you should make use of the fact that they're matched and add the patient as a co-variable into your design, e.g. ~ patient + group

ADD REPLY • link 3.0 years ago by Friederike 9.0k