cBioportal (http://www.cbioportal.org/index.do) revolutionized the way we look at genomics data. it offers a great interface to visualize TCGA data, integrate them, correlate alterations with survival etc etc. numerous investigators now utilize TCGA data to develop hypotheses or validate their results and the easiest interface to use is, indeed, cBioportal.
for the non-super-lay user, data are accessible also at TCGA firehose (https://gdac.broadinstitute.org/) but this tutorial is not for those data.
I have found online numerous people asking how to download full datasets from cBioportal and no clear answer is available. although cBioportal allows to download data from their website, there still has a limitation on the number of genes you can get.
therefore, the purpose of this short tutorial is to give you a quick n' dirty tweak to get full RNASeq data. the process is valid for other datasets too, this tutorial will show RNASeq data and leaves to the user the fun of expanding this to other data.
required stuff:
- R (https://www.r-project.org/)
- cgdsr R package (http://www.cbioportal.org/cgds_r.jsp)
- good music (www.pandora.com)
the first few steps are described in the link above, I add them here too just for completing the guide. so, load the library and create the cgdsr object
library(cgdsr)
mycgds <- CGDS("http://www.cbioportal.org/public-portal/")
then we need to chose which cancer study we want. I'll go with prostate cancer. the getCancerStudies() function will return a matrix with a list of the studies. the prostate cancer I want is in line 117, therefore:
prad_study <- getCancerStudies(mycgds)[117,1]
and I want to keep all patients:
pca_case_list <- getCaseLists(mycgds,prad_study)[1,1]
now, I need the list of all genes in the genome. one (of the million) ways of doing this is to get the RSEM_genes_normalized__data from the prostate cancer dataset at Firehose (see above) and extract the gene names from the first column. here I changed the dataset name to PRAD.txt (note, the following command is for UNIX/Mac terminal):
awk '{if(NR>1) print $1}' PRAD.txt | cut -d '|' -f1 | sort | uniq | grep -v "?" > gene_names.txt
now you can save this in a variable
rows_pca <- read.table("gene_names.txt")$V1
length(rows_pca)
[1] 20502
now the little trick... the cgdsr package allows you to download up to ~500 genes...so what we do it to download 500 genes at the time :)
i <- 1
while(i <= length(rows_pca)){
if(i == 1){
z_pca <- matrix()
}
k <- i+499
temp <- getProfileData(mycgds, rows_pca[i:k], "prad_tcga_rna_seq_v2_mrna_median_Zscores",pca_case_list)
message(paste("done for genes -->",i," to ",k))
if(dim(temp)[1] != 0){
z_pca <- cbind(z_pca, temp)
}
i <- k+1
}
now, data here have genes as columns and samples as rows, we transpose it and since there will be a number of rows with NAs, we can remove them.
z_pca <- t(z_pca)
rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sum(is.na(x))) == ncol(z_pca)))
z_pca <- z_pca[-rem_pca,]
dim(z_pca)
[1] 18015 491
now, you don't need to do this last step, but if you want format the columns using the 12 digit code, you can by doing this:
colnames(z_pca) <- substr(colnames(z_pca),1,12)
and voila', now you have all the z-scores used in cBioportal :)
comments and feedback are always welcome!
Very helpful example, thanks.
Hi, I have one question about how to obtain case and control data from TCGA database.
you can't get normals form cBioportal, you need to use TCGA Firehose
But,please, dont forget these will be all processed datasets. cBioPortal also uses processed data.
they are not the same "processed". cBiportal gives you expression data in tumor compared to normal. TCGA firehose gives you level 3 TCGA RNASeq data (aligned, mapped, normalized RPKM/FPKM gene/exon/etc level), which cBioportal does not.
I take this error after execution of your script. Could you please help me about this?Thanks a lot! Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open URL 'http://www.cbioportal.org/public-portal/webservice.do?cmd=getProfileData&gene_list=A1BG,A1CF,A2BP1,A2LD1,A2ML1,A2M,A4GALT,A4GNT,AAA1,AAAS,AACSL,AACS,AADACL2,AADACL3,AADACL4,AADAC,AADAT,AAGAB,AAK1,AAMP,AANAT,AARS2,AARSD1,AARS,AASDHPPT,AASDH,AASS,AATF,AATK,ABAT,ABCA10,ABCA11P,ABCA12,ABCA13,ABCA17P,ABCA1,ABCA2,ABCA3,ABCA4,ABCA5,ABCA6,ABCA7,ABCA8,ABCA9,ABCB10,ABCB11,ABCB1,ABCB4,ABCB5,ABCB6,ABCB7,ABCB8,ABCB9,ABCC10,ABCC11,ABCC12,ABCC13,ABCC1,ABCC2,ABCC3,ABCC4,ABCC5,ABCC6P1,ABCC6P2,ABCC6,ABCC8,ABCC9,ABCD1,ABCD2,ABCD3,ABCD4,ABCE1,ABCF1,ABCF2,ABCF3,ABCG1,ABCG2,ABCG4,ABCG5,ABCG8,ABHD10,ABHD11,ABHD12B,ABHD12,ABHD13,ABHD14A,ABHD14B,ABHD15,ABHD1,ABHD2,ABHD3,ABHD4,ABHD5,ABHD6,ABHD8,ABI1,ABI2,ABI3BP,ABI3,ABL1,ABL2,ABLIM1,ABLIM2,ABLIM3,ABO,ABP1,ABRA,ABR,ABT1,ABTB1,ABTB2,ACAA1,ACAA2,ACACA,ACACB,ACAD10,ACAD11,ACAD8,ACAD9,ACADL,ACADM,ACADSB,ACADS,ACADVL,ACAN,ACAP1,ACAP2,ACAP3,ACAT1,ACAT2,ACBD3,ACBD4,ACBD5,ACBD6,ACBD7,ACCN1,ACCN2,ACCN3,ACCN4,ACCN5,ACCSL,ACCS,ACD,ACE2,ACER1,ACER2,ACER3,A... <truncated>
I am new to the TCGA and in the comments above, you mention that this line is for UNIX/MAC operating systems:
This is such a great tutorial and I am wondering if there is an equivalent for the Windows environment. Also, are there R statements to download the RSEM_genes_normalized__data directly from the Broad (a line before writing them out to a file) and write out the file that would allow Windows users to finish the tutorial?
Hi There,
Thank you for putting up this tutorial. It is very helpful. However, I have errors that I need support with.
1.) The part with downloading all patients. I used the
pca_case_list <- getCaseLists(mycgds,prad_study)[1,1]
, here I get a result valueThen applying this command
I get the following error
This is certainly the fact that there is not getGeneticProfiles value in the getProfileData function. The matrix of the getGeneticProfiles is a matrix with many ids. I would appreciate if someone could guide me here.
2.) The second point is that the remove method in the following code
gives me the following error:
Kindly could someone provide me help as to what to do here in order to get the code working?
Thanks,
akshayb04
I was able to rectify one part of the error
Error in sumis.na(x) : could not find function "sumis.na"
Here the actual syntax for the code is
rem_pca <- as.numeric(which(apply(z_pca,1,function(x) sumis.na(x))) == ncol(z_pca)))
Also, I had forgotten to load/install package dplyr.
Thank you, my friend! You solved three weeks of headache.