Hi Everyone,
I am trying to analyze TCGA-BRCA (harmonized) data using TCGAbiolinks and I am stuck at the DEA stage where I get the following error:
Error in dataFilt.brca[, dataFiltNT] : subscript out of bounds
I can't seem to pinpoint where and what is giving me this error. I have attached my code which I put together trying to teach myself using tutorials. What I want to do is: compare expression of my gene of interest (SOX10) between different PAM50 subtypes. The thing is, when I do DE on normal versus tumor, it shows that SOX10 is downregulated. However, when I try to visualize SOX10 expression on UCSC (even cBioportal) I see that it is upregulated in basal tumors compared to normal (or normal-like, I'm so confused?!!). I specifically want to show that even though SOX10 is downregulated in pooled subtypes when compared to normal, it is upregulated if only basal subtype is considered. Also, I tried doing multiple constrasts using the contrast.matrix function but it gave me a whole lot of errors so I gave up.
Is there a simpler way to do this? I really need to show stats on a nice boxplot of Log2FC versus PAM50 subtype! I also want to further stratify each Pam50 into Her2+ and Her2-
Please help, and thank you!!
TCGA BRCA - SOX10
library(TCGAbiolinks)
library(ggplot2)
library(dplyr)
library(DESeq2)
Cancer <- "TCGA-BRCA"
query_BRCA <- GDCquery(project = Cancer,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samples_BRCA <- getResults(query_BRCA,cols=c("cases"))
queryDown_brca <- GDCquery(project = Cancer,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = samples_BRCA)
GDCdownload(query = queryDown_brca, method = "api", files.per.chunk = 50)
dataPrep1.brca <- GDCprepare(query = queryDown_brca,
save = TRUE,
save.filename = "TCGA_BRCA_HTSeq_Counts.rda")
dataPrep.brca <- TCGAanalyze_Preprocessing(object = dataPrep1.brca,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm.brca <- TCGAanalyze_Normalization(tabDF = dataPrep.brca,
geneInfo = geneInfoHT,
method = "gcContent")
dataNorm.brca <- TCGAanalyze_Normalization(tabDF = dataNorm.brca,
geneInfo = geneInfoHT,
method = "geneLength")
dataFilt.brca <- TCGAanalyze_Filtering(tabDF = dataNorm.brca,
method = "quantile",
qnt.cut = 0.25)
dim(dataFilt.brca)
### Prepping for DEA ###
sampleTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt.brca), typesample = "TP")
sampleNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt.brca), typesample = "NT")
dataFiltTP <- dataFilt.brca[,sampleTP]
dataFiltNT <- dataFilt.brca[,sampleNT]
dataSubt_BRCA <-TCGAquery_subtype(tumor = "BRCA")
dataSubt_BRCA$BRCA_Subtype_PAM50
samplePam50_BRCA.LumA <- dataSubt_BRCA[dataSubt_BRCA$BRCA_Subtype_PAM50 %in% "LumA",]
samplePam50_BRCA.LumA$BRCA_Subtype_PAM50
dataFiltTP.LumA <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% samplePam50_BRCA.LumA$patient]
dim(dataFiltNT)
dim(dataFiltTP.LumA)
DEG_LumA_NT <- TCGAanalyze_DEA(mat1 = dataFilt.brca[,dataFiltNT],
mat2 = dataFilt.brca[,dataFiltTP.LumA],
pipeline="limma",
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.05 ,
logFC.cut = 1,
voom=TRUE,
method = "glmLRT", ClinicalDF = data.frame())
Error in dataFilt.brca[, dataFiltNT] : subscript out of bounds
There are issues, generally, with the functionality of
TCGAanalyze_DEA()
. It was first elaborated here: Matched paired DEA analysis of BRCA samples using TCGAbiolinks, works for paired = FALSE but not paired = TRUEI contacted the main developer, who then forwarded it to another developer responsible for that part of the package, and nobody ever got back to me. I then raised a ticket (issue) on GitHub about it but none of the developers responded.
If I may suggest that you obtain the data directly from the GDC and process it from there.
Hi Kevin, is there any update on this issue? i am also facing the same issue [Error in dataFilt[, samples] : subscript out of bounds)]. Thanks in advance.
Nope, there was never a response: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/276
As I'm now a developer myself, I know that it can genuinely be difficult to find time to follow up to these issues. However, in the case of the 'paired' analysis issue, TCGAbiolinks is actually providing misleading results.
If you can, I would get the data from Xena Browser and then re-process it. If you're just interested in a single gene, you could also use cBioPortal (GUI)
Did you try the code in my other linked thread?
No, I did not try, yet. But, I'll try to give it a go. I was working on manually curated datasets (patient IDs divided based on clinical features). I tried codes on several different sets, sometime it works, sometimes it doesn't.
I have tried the TCGAutils R package it worked for me. They have provided functions for multi-way ID/UUID/barcode conversion. Have a look once. http://bioconductor.org/packages/release/bioc/vignettes/TCGAutils/inst/doc/TCGAutils.html#tcga-barcode-to-uuid