Nearly all TCGA data/results can be found at Broad Institute's Firehose pipelines. Get raw data/results here or browse the web-based UIs at MSKCC's cbioportal.org or Broad's firebrowse.org.
NOTE: This is just a temporary solution, while I figure out how to use the GDC via CLI. :)
There is a convenient python script to download raw data/results. Download it as follows:
mkdir scripts
curl -o scripts/firehose_get_latest.zip http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip
unzip -d scripts scripts/firehose_get_latest.zip
Here is how to use that tool to download the normalized per-gene expression estimates from RNA-seq data:
./scripts/firehose_get -b -only Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data data latest
It creates a folder structure with gzipped tarballs in separate tumor-type subfolders. Unpack all the tarballs:
mkdir rna_seq
for file in stddata__*/*/*/*RSEM*.Level_3*.tar.gz; do tar -zxf $file -C rna_seq; done
Rename the resulting subfolders to just the tumor type codes, using some in-line Perl and bash:
ls -d rna_seq/gdac* | perl -ne 'chomp; ($t)=m/gdac.broadinstitute.org_(\w+)/; print "mv $_ rna_seq/$t\n"' | bash
Delete the separate colon/rectal cohorts, leaving behind only the combined cohort COADREAD
:
rm -rf rna_seq/{COAD,READ}
There are also KIPAN
(KICH+KIRC+KIRP) and GBMLGG
(GBM+LGG), but keep them, they're interesting. The per-gene RNA-expression estimates are now in these files:
rna_seq/*/*.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt
I agree that the transition is very confusing, not least because of the way the gdc-portal displays data files for downloading.
Anyways, have you been here - Firehose On the landing page, at the row for COAD, under Data col, click Browse. The pop-up window that opens should be able to give you what you are looking for.
The file naming is a bit different now, but you would be able to make out. I haven't used the R library, but the Firehose site has its own client (like a wget). There is an R package described as well.
Thanks for the suggestion. I will give a try to Firehose
Hello Amit, Do we get access to protected data in Firebrowse?
Nopes. I think that would be GDC.