Question

What is an optimal normalization approach for comparing GTEx RNASeq normal vs TCGA tumor samples?

9

Entering edit mode

9.2 years ago

Samir ▴ 210

I am trying to compare RNA-seq read counts from normal tissue samples from GTEx to tissue-level matched TCGA tumor samples. Since TCGA has also a few normals, I am trying to merge those adjacent tissue normals (for solid tumors) with matched tissue from GTEx. While merging normals from GTEx with TCGA normals, I want to remove batch effects (? library preparation, protocol and sequencing platform between GTEx and TCGA, assuming intra-group variation is minimal)and make GTEx samples serve as normals for matching TCGA tumor types.

For breast cancer set, I have taken read count level data for total of 72 normals (6 from TCGA BRCA, remaining GTEx) and 100 BRCA tumor samples, ran DESeq2 as per http://www.bioconductor.org/help/workflows/rnaseqGene/#construct In brief,

merge count level data from GTEx and TCGA, keep only matching genes (gencode v19) in both sets.
sample info has group factor with two levels: gtex (66) and pcawg (106) and sample_type factor with two levels: normal (72) and tumor (100).
My DESeqDataset is like this.


dds <- DESeqDataSetFromMatrix(countData = pcawg_data,
               colData = pcawg_id,
               design = ~ group + sample_type)

dds
nrow(dds)

dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

rld <- rlog(dds, blind = FALSE, fast = TRUE)
head(assay(rld), 3)

dds <- DESeq(dds, parallel = T)

PCA plot following rlog transform were not able fix batch effect and TCGA normals are far distant from GTEx normals.

I also tried correcting batch effect using surrogate variable analysis package, SVA as per http://www.bioconductor.org/help/workflows/rnaseqGene/#batch but no luck.


mod <- model.matrix(~ sample_type, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)

This is true for a few other tumor types too.

I have read few of recently published best practices for RNA-seq normalization and will run RUVSeq for fpkm level data. However, this seems challenging to me given normal samples originating from two different studies and no common samples in between. Good to get comments from experts here and get optimal normalization approach, if any for these datasets.

Thanks,

Samir

RNA-Seq normalization DESeq2 SVA • 7.9k views

ADD COMMENT • link updated 7.4 years ago by mforde84 ★ 1.4k • written 9.2 years ago by Samir ▴ 210

score 3 · Answer 1 · 2017-07-05

3

Entering edit mode

7.4 years ago

syrttgump ▴ 50

I found this: https://github.com/mskcc/RNAseqDB Which is a datasets of normalized TCGA and GTEx RNA-Seq data.

ADD COMMENT • link 7.4 years ago by syrttgump ▴ 50

score 0 · Answer 2 · 2017-02-21

0

Entering edit mode

7.8 years ago

max.amicone • 0

Hello Samir! I'm facing the same problem you posted here. Did you find any optimal solution to yours? Thank you in advance, Massimo.

ADD COMMENT • link 7.8 years ago by max.amicone • 0

score 0 · Answer 3 · 2017-07-05

I recently did some normalization work on the harmonized COAD dataset. Granted it's just TCGA data, but the principles should the same when adding samples from other studies.

normalize_counts <- function(raw_counts, batch){
    require(edgeR)
    require(limma)
    y <- DGEList(counts = raw_counts)
    y <- y[!rowSums(y$counts == 0) == ncol(raw_counts),] 
    A <- aveLogCPM(y)
    y2 <- y[A>1,]
    y3 <- calcNormFactors(y2, method = "TMM")
    dge <- voomWithQualityWeights(y3, normalization="quantile", plot=FALSE)
    rbe <- removeBatchEffect(dge, batch)
    return(rbe)
}

Maybe this might work.