Now that TCGA moved under Genomic data commons (GDC), Almost all the prevous user are struggling to retrive the same information. This tutorial try to show how to download TCGA data from GDC
Step 1. Obtaining a Manifest File for Data Download (manifest is use to specify type of the data to download)
https://gdc-portal.nci.nih.gov/legacy-archive/search/f
Step 2. Install download software: GDC Data Transfer Tool (Linux, Windows, MACS)
https://gdc.nci.nih.gov/access-data/gdc-data-transfer-tool
Step 3.1 Downloading Data Using a Manifest File (gdc_manifest.lungCancer.txt)
gdc-client download -m gdc_manifest.lungCancer.txt
Step 3.2 Downloading Single Data Using a UUID (UUID can be found in manifest file)
gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005
Step 3.3 Downloading Controlled Data (user authentication token is required)
gdc-client download -m gdc_manifest_controled.txt -t
gdc-user-passwdcode.txt
FQA:
1, ./gdc-client: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /tmp/_MEI5oSpPi/libz.so.1)
Answer: glibc 2.12 is the latest that's available for CentOS 6. that means CentOS cannot used to download the data(UCSD, TSCC).
2, How to download controlled data from GDC
3, Eventually, I asked TSCC manager to help me install fastq-dump in TSCC
4, Download failed happened sometimes since the internet problem, but don't worry, just try again
Thanks for sharing, Could you please give some more detail about:
I downloaded expression data for TCGA-ESCA, there are 164 cases for this cancer, but there are 519 files with extension *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz. how to map with Tumor-ID/ Aliquot)id and make expression matrix (total tumor sample * all genes)?
I had a similar issue with mapping all file_id's to one case_id.
This page GDC API Getting_Started indicated that I can "expand" a section for the "cases" endpoint and voiala I got the case_id <===> file_id mapping:
Example (find all files available for case_id = 31bd8589-378c-40e5-8b7f-3b4c81f304be) :
1, manifest is formed by what you want to download. that means it is same with what you selected in the first stage, not same for all the data types. Finally, manifest is formed by what you selected. (add to cart in GDC website, means it was selected)
2, No, you can not download the data maqtrix. you need download them all and then merge them by perl, R, python or C
Thanks @Shicheng Guo;
Why there are three different types of fiiles: *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz., I have 519 directories for 164 cases, so how to merge them. they should be (164 * 3= 492). And how to match UUID to TCGA-sample-ID.
These are three different files:
Fragment Count (HT-Seq) ——> Gene Count ——>Count Normalization —-> FPKM ——>Upper Quantile Normalization ——>FPKM-UQ
https://gdc.nci.nih.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification
but how to map UUIDs with TCGA-patient.bar.code ID
Hi! I had the same UUIDs to TCGA ID problem. I solved it using R to write a JSON sentence that is the used in the command line
Here i wrote a post about it
Hope it is usefull!
I really think something needs to done about the gdc-client tool. I cannot install it on Mac OS X sierra... Downloaded the tool more than five time, unpacked, double click and all I get is the same error as given below:
Musalulas-MacBook-Pro:~ sinkala$ /Users/sinkala/Downloads/gdc-client ; exit; usage: gdc-client [-h] [--version] {download,upload,interactive} ... gdc-client: error: too few arguments logout Saving session... ...copying shared history... ...saving history...truncating history files... ...completed.
[Process completed]
I have also tried the alternate ways of installing the thing, but I have not been successful either. I have tried to download the data directly from the data portal; even that does not work for a file size less than 400mb - the server does not respond or something like that. :( :(
The gdc-client is a command-line tool. You cannot just double-click on it. See https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/. If you have other problems, feel free to contact the gdc support staff: support@nci-gdc.datacommons.io.
Hi, I am new to BioStars so apologies for any syntax errors.
So after trying to start the gdc-client.exe application it presents with the following error (then disappears):
How to solve:
You need to run the program using command line; the hideous interface summoned by typing 'cmd' into the start menu. You need to first set a Path to the folder which contains the unzipped "gdc-client.exe" file.
Here is a guide:https://www.wikihow.com/Run-a-Program-on-Command-Prompt
After doing this you can start using commands for the gdc-client program. For example, type in the "gdc-client download" command followed by "-m" for manifest, then the file location:
If you don't know how to make a manifest go here: http://www.andrewjanowczyk.com/download-tcga-digital-pathology-images-ffpe/
Finally
If you start getting error messages about there being 'no such file or directory' try dragging the manifest file into the same folder your gdc-client.exe application is in, then simply type in the command followed by the manifest file name:
Now that the file is in the folder your path is set to it doesn't need the location specified. It will then start downloading (hopefully)!
Enjoy
Install download software: GDC Data Transfer Tool (Linux)
Please someone there.. could you helpe me.. I'm having trouble intalling gdc-client on ubuntu 14 I downloaded the zip gdc and extracted after that, on the shell i wrote.. ./ gdc-client.. but nothing happen,,
You can always contact support@nci-gdc.datacommons.io for support.
Try using chmod to change the permissions before executing the file.
I'm using gdc-client v1.2.0. I specifically sort my manifest file by patient id so I may download tumor-normal pair BAMs one after another. But in reality, BAMs were downloaded in some random order, which is not from the top to bottom of my sorted manifest file. Do other people have the same problem? Is there a way to fix it?
Hi, I have download GDC client tool to download files from GDC. As the download folder should contain data or zipped data and logs folder. My files are downloaded successfully. However, I see only few logs folder. For example for Bladder Urothelial carcinoma (BLCA) manifest files includes 433 UUID. But only 53 logs folder were found. Thus could you let me know is the download files are accurate?
Thank you.
Does someone know how to fix this error?
Internet problem? tried several times and then failed. I guess it is the internet problem. or check the quota of the hard-disk