Question

Is it possible to download cancer, normal, and matched tissue data from GDC? From same patient samples? How?

0

Entering edit mode

8.1 years ago

fr ▴ 210

I'm trying to download gene expression data from GDC. Ideally, I'd like to get normal, cancer, and matched data for the same patients.

How to identify whether a certain gene expression data is for cancer or matched data? (I am aware that some TCGA barcodes used to specify this info).

Is it easy to do this in Matlab / Perl, or is GDC Data Transfer Tool the best way?

Thanks in advance

gdc • 4.1k views

ADD COMMENT • link 8.1 years ago by fr ▴ 210

score 5 · Accepted Answer · 2016-10-21

Hello,

The best way to match tumor and normal samples to a specific patient is through the GDC API. The API can be used to output a tab-delimited (TSV) file that associates file name, file id (UUID), sample type (tumor / normal), and patient ID (TCGA barcode).

The general query structure will look like this:

curl "https://gdc-api.nci.nih.gov/files?size=100000&format=tsv&filters=XXXX&fields=YYYY"

filters= :

The XXXX will be replaced with a URL-encoded JSON query that will filter a list of files. The JSON query can be generated yourself and then URL encoded, or the filtering can be performed on the GDC Data Portal (https://gdc-portal.nci.nih.gov/search/s?facetTab=files). You can copy and paste the string that appears where the * are displayed in the following:

https://gdc-portal.nci.nih.gov/search/f?filters=**********&facetTab=cases

(Note: sometimes the URL-encoded query goes all the way to the end of the URL, so copy up until the & or the end)

fields= :

The YYYY will be replaced with a comma-separated list of fields that will be generated as columns in your TSV. To get the aforementioned fields you'll need:

file_name : the name of the file

file_id : the UUID of the file

cases.samples.sample_type : the tumor/ normal status of the sample

cases.submitter_id : the patient TCGA barcode

So it will be organized like this: file_name,file_id,cases.samples.sample_type,cases.submitter_id

More fields can be added to this and are listed here: https://gdc-docs.nci.nih.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields

For more information about the API, see the API Users Guide here: https://gdc-docs.nci.nih.gov/API/Users_Guide/Getting_Started/

Data Transfer Tool

Once you get the TSV file containing the files that you are looking for and their metadata, you can download them using the GDC Data Transfer Tool (DTT). The UUID column in the TSV can be used as a manifest by extracting it into a plain text file with the header "id" at the top. Then run the DTT with the command "gdc-client download -m <manifest_file_name>".

The DTT can be downloaded here: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

and you can read about its functionalities here: https://gdc-docs.nci.nih.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/

I hope this helps, follow up on this thread if you have (or anyone else has) any questions about the use of the GDC API or DTT.

Best,

-Bill - GDC User Services Team