I am trying to compare RNA-seq read counts from normal tissue samples from GTEx to tissue-level matched TCGA tumor samples. Since TCGA has also a few normals, I am trying to merge those adjacent tissue normals (for solid tumors) with matched tissue from GTEx. While merging normals from GTEx with TCGA normals, I want to remove batch effects (? library preparation, protocol and sequencing platform between GTEx and TCGA, assuming intra-group variation is minimal)and make GTEx samples serve as normals for matching TCGA tumor types.
For breast cancer set, I have taken read count level data for total of 72 normals (6 from TCGA BRCA, remaining GTEx) and 100 BRCA tumor samples, ran DESeq2 as per http://www.bioconductor.org/help/workflows/rnaseqGene/#construct In brief,
-
merge count level data from GTEx and TCGA, keep only matching genes (gencode v19) in both sets.
-
sample info has
group
factor with two levels: gtex (66) and pcawg (106) andsample_type
factor with two levels: normal (72) and tumor (100). -
My DESeqDataset is like this.
dds <- DESeqDataSetFromMatrix(countData = pcawg_data,
colData = pcawg_id,
design = ~ group + sample_type)
dds
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
rld <- rlog(dds, blind = FALSE, fast = TRUE)
head(assay(rld), 3)
dds <- DESeq(dds, parallel = T)
- PCA plot following rlog transform were not able fix batch effect and TCGA normals are far distant from GTEx normals.
- I also tried correcting batch effect using surrogate variable analysis package, SVA as per http://www.bioconductor.org/help/workflows/rnaseqGene/#batch but no luck.
mod <- model.matrix(~ sample_type, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
This is true for a few other tumor types too.
I have read few of recently published best practices for RNA-seq normalization and will run RUVSeq for fpkm level data. However, this seems challenging to me given normal samples originating from two different studies and no common samples in between. Good to get comments from experts here and get optimal normalization approach, if any for these datasets.
Thanks,
Samir