Hi all, I would like to download the bulk RNA-seq data for all patients in the TCGA-LUAD cohort using TCGAbiolinks. Does this exist as a single matrix?
I have read the package vignette and can download individual cases however does TCGAbiolinks facilitate downloading a single matrix of all the patients?
I ask because if you download similar data from Xena browser you can download a 585 column matrix.
I tried this with TCGAbiolinks:
test<-GDCquery(project = 'TCGA-LUAD', data.category = 'Gene expression', data.type = 'Gene expression quantification', platform = "Illumina HiSeq", file.type='results', legacy = TRUE)
dim(getResults(test))
This results in 600 files.
I tried the code below to see if one file was much bigger than the others but it appears not, hence all 600 files are separate cases:
getResults(test) %>% arrange(desc(file_size)) %>% head(10)
Finally I interrogated the duplicated cases and while some cases have a file for both cancer and normal tissue (this is OK), other patients have 2 or 3 files all for cancer tissue. Which file should I choose?!
dups_index <- which(duplicated(getResults(test)[,"cases.submitter_id"]))
dups <- getResults(test)[,"cases.submitter_id"][dups_index]
for(i in 1:length(dups)){
print(i)
print(getResults(test) %>% filter(cases.submitter_id == dups[i]) %>% select(sample_type))
}
Any help appreciated, thanks in advance
Thanks, I managed to download the whole matrix using this. There are still duplicated entries (e.g. more than two tumour samples for the same patient) with no obvious rationale for which to delete, but at least I have the whole matrix now - thanks
(apologies this should be a reply to the answer above but can't seem to get this to work)
Are you not able to use
ADD COMMENT
button?Hi, yes seems to be working now - thanks