From TCGA to GDC (Genomic data commons)
3
4
Entering edit mode
8.4 years ago

Hello,

I was using TCGA data related to Colon adenocarcinoma (COAD). In the specific I was using "IlluminaGA_RNASeqV2", "IlluminaHiSeq_RNASeqV2" platforms

For the COAD cancer and for those platforms were available the level 3 information. I was using the raw count from the rsem.genes.results files.

Now that TCGA moved under Genomic data commons (GDC), i'm struggling to retrive the same information. I would like to understand how to download from https://gdc-portal.nci.nih.gov/ the same information that were available from TGCA.

I was using TCGABiolinks, but now seems not working anymore. Any suggestion about R library to import GDC data?

Thanks

R TCGA gdc • 9.0k views
ADD COMMENT
4
Entering edit mode

I agree that the transition is very confusing, not least because of the way the gdc-portal displays data files for downloading.

Anyways, have you been here - Firehose On the landing page, at the row for COAD, under Data col, click Browse. The pop-up window that opens should be able to give you what you are looking for.

The file naming is a bit different now, but you would be able to make out. I haven't used the R library, but the Firehose site has its own client (like a wget). There is an R package described as well.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion. I will give a try to Firehose

ADD REPLY
0
Entering edit mode

Hello Amit, Do we get access to protected data in Firebrowse?

ADD REPLY
0
Entering edit mode

Nopes. I think that would be GDC.

ADD REPLY
8
Entering edit mode
8.4 years ago

Nearly all TCGA data/results can be found at Broad Institute's Firehose pipelines. Get raw data/results here or browse the web-based UIs at MSKCC's cbioportal.org or Broad's firebrowse.org.

NOTE: This is just a temporary solution, while I figure out how to use the GDC via CLI. :)

There is a convenient python script to download raw data/results. Download it as follows:

mkdir scripts
curl -o scripts/firehose_get_latest.zip http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip
unzip -d scripts scripts/firehose_get_latest.zip

Here is how to use that tool to download the normalized per-gene expression estimates from RNA-seq data:

./scripts/firehose_get -b -only Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data data latest

It creates a folder structure with gzipped tarballs in separate tumor-type subfolders. Unpack all the tarballs:

mkdir rna_seq
for file in stddata__*/*/*/*RSEM*.Level_3*.tar.gz; do tar -zxf $file -C rna_seq; done

Rename the resulting subfolders to just the tumor type codes, using some in-line Perl and bash:

ls -d rna_seq/gdac* | perl -ne 'chomp; ($t)=m/gdac.broadinstitute.org_(\w+)/; print "mv $_ rna_seq/$t\n"' | bash

Delete the separate colon/rectal cohorts, leaving behind only the combined cohort COADREAD:

rm -rf rna_seq/{COAD,READ}

There are also KIPAN (KICH+KIRC+KIRP) and GBMLGG (GBM+LGG), but keep them, they're interesting. The per-gene RNA-expression estimates are now in these files:

rna_seq/*/*.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt
ADD COMMENT
0
Entering edit mode

Hi Cyriac,

Do you know the difference between illumina rnaseq2 vs illumina rnaseq and the current rna seq data on GDC?

I see several type in the link but on GDC portal, there is only one kind of RNA-seq data.

Thanks

ADD REPLY
0
Entering edit mode

v2 reports RSEM, the other reports RPKM. GDC only reports RSEM.

ADD REPLY
0
Entering edit mode

thanks,

then what is the relationship between RSEM and HTSeq?

I saw GDC have HTSeq-counts, HTseq-FPKM, HTseq-FPKM-UQ.

I thought RSEM will generate calculated expression, and HTseq is the raw counts,

I am very confused about what data I am downloading...

And looks like if I use the HTseq from new GDC portal, I have to combine them by myself since they download the file folder by folder separately...Does some see a merge HTSeq file?

ADD REPLY
0
Entering edit mode

Sorry, I was wrong. GDC runs their own RNA-seq pipeline defined here, which appears to report FPKM.

ADD REPLY
0
Entering edit mode

AFAIK RNAseq TCGA V1 analysis (old) used BWA and the V2 analysis (new) which uses MapSplice. All V1 data was reprocessed as V2. So there should be only one kind of RNAseq data (submitted by UNC).

ADD REPLY
5
Entering edit mode
8.3 years ago
tiagochst ▴ 70

TCGAbiolinks was fixed to search, download and prepare data from GDC data portal.

The new vignette is already in bioconductor: https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html

ADD COMMENT
0
Entering edit mode
8.3 years ago
Mike ★ 1.9k

Now TCGAbiolinks is updated, they replace "TCGAquery" with "GDCquery" function.

The functions TCGAquery, TCGAdownload, TCGAPrepare, TCGAquery_maf, TCGAquery_clinical, were replaced by GDCquery, GDCdownload, GDCPrepare, GDCquery_maf, GDCquery_clinical.

And it can acess both the GDC and GDC Legacy Archive.

https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#gdcquery-searching-tcga-open-access-data

ADD COMMENT

Login before adding your answer.

Traffic: 2539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6