Hello,
The best way to match tumor and normal samples to a specific patient is through the GDC API. The API can be used to output a tab-delimited (TSV) file that associates file name, file id (UUID), sample type (tumor / normal), and patient ID (TCGA barcode).
The general query structure will look like this:
curl "https://gdc-api.nci.nih.gov/files?size=100000&format=tsv&filters=XXXX&fields=YYYY"
filters= :
The XXXX will be replaced with a URL-encoded JSON query that will filter a list of files. The JSON query can be generated yourself and then URL encoded, or the filtering can be performed on the GDC Data Portal (https://gdc-portal.nci.nih.gov/search/s?facetTab=files). You can copy and paste the string that appears where the * are displayed in the following:
https://gdc-portal.nci.nih.gov/search/f?filters=**********&facetTab=cases
(Note: sometimes the URL-encoded query goes all the way to the end of the URL, so copy up until the & or the end)
fields= :
The YYYY will be replaced with a comma-separated list of fields that will be generated as columns in your TSV. To get the aforementioned fields you'll need:
file_name : the name of the file
file_id : the UUID of the file
cases.samples.sample_type : the tumor/ normal status of the sample
cases.submitter_id : the patient TCGA barcode
So it will be organized like this: file_name,file_id,cases.samples.sample_type,cases.submitter_id
More fields can be added to this and are listed here: https://gdc-docs.nci.nih.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields
For more information about the API, see the API Users Guide here: https://gdc-docs.nci.nih.gov/API/Users_Guide/Getting_Started/
Data Transfer Tool
Once you get the TSV file containing the files that you are looking for and their metadata, you can download them using the GDC Data Transfer Tool (DTT). The UUID column in the TSV can be used as a manifest by extracting it into a plain text file with the header "id" at the top. Then run the DTT with the command "gdc-client download -m <manifest_file_name>".
The DTT can be downloaded here: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
and you can read about its functionalities here: https://gdc-docs.nci.nih.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/
I hope this helps, follow up on this thread if you have (or anyone else has) any questions about the use of the GDC API or DTT.
Best,
-Bill - GDC User Services Team
Thank you so much, I was already able to generate the sort of metadata file that I needed and to download many files of interest. I would like to ask one more thing though: could you provide some more input as to where to generate the filters through JSON? For instance, if I want to retrieve gene expression data (FPKMs) for different cancer types, one by one, how may I do it?
I am aware that one way to do it is to select all these cancer types from the GDC Data Portal, and then batch download them the same way as described.
Thanks again for your previous input and thanks in advance!