Tutorial:Protocol To Downlad TCGA Data From GDC
3
33
Entering edit mode
8.3 years ago
Shicheng Guo ★ 9.5k

Now that TCGA moved under Genomic data commons (GDC), Almost all the prevous user are struggling to retrive the same information. This tutorial try to show how to download TCGA data from GDC

Step 1. Obtaining a Manifest File for Data Download (manifest is use to specify type of the data to download)

https://gdc-portal.nci.nih.gov/legacy-archive/search/f

Step 2. Install download software: GDC Data Transfer Tool (Linux, Windows, MACS)

https://gdc.nci.nih.gov/access-data/gdc-data-transfer-tool

Step 3.1 Downloading Data Using a Manifest File (gdc_manifest.lungCancer.txt)

gdc-client download -m gdc_manifest.lungCancer.txt

Step 3.2 Downloading Single Data Using a UUID (UUID can be found in manifest file)

gdc-client download 22a29915-6712-4f7a-8dba-985ae9a1f005

Step 3.3 Downloading Controlled Data (user authentication token is required)

gdc-client download -m gdc_manifest_controled.txt -t
gdc-user-passwdcode.txt

FQA:

1, ./gdc-client: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /tmp/_MEI5oSpPi/libz.so.1)

Answer: glibc 2.12 is the latest that's available for CentOS 6. that means CentOS cannot used to download the data(UCSD, TSCC).

2, How to download controlled data from GDC

3, Eventually, I asked TSCC manager to help me install fastq-dump in TSCC

4, Download failed happened sometimes since the internet problem, but don't worry, just try again

GDC methylation TCGA • 42k views
ADD COMMENT
3
Entering edit mode

Thanks for sharing, Could you please give some more detail about:

  1. How to extract different data types, (expression, methylation, clinical etc.). Is manifest file is same for all data types?
  2. Is it possible to download expreesion matrix (for all samples in a single file) with TCGA-tumor-ID, instead of UUIDs.
ADD REPLY
1
Entering edit mode

I downloaded expression data for TCGA-ESCA, there are 164 cases for this cancer, but there are 519 files with extension *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz. how to map with Tumor-ID/ Aliquot)id and make expression matrix (total tumor sample * all genes)?

ADD REPLY
0
Entering edit mode

I had a similar issue with mapping all file_id's to one case_id.

This page GDC API Getting_Started indicated that I can "expand" a section for the "cases" endpoint and voiala I got the case_id <===> file_id mapping:

Example (find all files available for case_id = 31bd8589-378c-40e5-8b7f-3b4c81f304be) :

curl -s 'https://gdc-api.nci.nih.gov/cases/31bd8589-378c-40e5-8b7f-3b4c81f304be?pretty=true&expand=files' | grep -E 'file_id|file_name' | paste -d " "  - -

        "file_name": "323800b5-c319-4fd8-ac96-87193afb93e4.FPKM.txt.gz",          "file_id": "e400f345-b273-4cfc-9a1e-d1fff79f5eee",
        "file_name": "3b600545-75cb-42df-ad6d-3b5c977ff7d5.vep.reheader.vcf.gz",          "file_id": "3b600545-75cb-42df-ad6d-3b5c977ff7d5",
        "file_name": "e5b0c8fa-2b7e-4140-87d9-a5046490a08b.snp.Somatic.hc.vcf.gz",          "file_id": "e5b0c8fa-2b7e-4140-87d9-a5046490a08b",
        "file_name": "60c334bb-d579-4cf3-9fd0-e450c3e652d8.vep.reheader.vcf.gz",          "file_id": "60c334bb-d579-4cf3-9fd0-e450c3e652d8",
        "file_name": "c6b1fb77-8102-42bb-bdc0-a48270b7be9f.vcf.gz",          "file_id": "c6b1fb77-8102-42bb-bdc0-a48270b7be9f",
        "file_name": "mirnas.quantification.txt",          "file_id": "440b3abb-63e1-4a67-9708-31ee19081ec7",
        "file_name": "TCGA.READ.mutect.c49c62e7-dec8-4b77-9ba5-88d196c8ae94.protected.maf.gz",          "file_id": "c49c62e7-dec8-4b77-9ba5-88d196c8ae94",
        "file_name": "nationwidechildrens.org_clinical.TCGA-AG-A026.xml",          "file_id": "1fda4f40-ad4e-4b91-9379-c61b611769ee",
ADD REPLY
0
Entering edit mode

1, manifest is formed by what you want to download. that means it is same with what you selected in the first stage, not same for all the data types. Finally, manifest is formed by what you selected. (add to cart in GDC website, means it was selected)

2, No, you can not download the data maqtrix. you need download them all and then merge them by perl, R, python or C

ADD REPLY
0
Entering edit mode

Thanks @Shicheng Guo;

Why there are three different types of fiiles: *.FPKM.txt.gz , *.FPKM-UQ.txt.gz & *.htseq.counts.gz., I have 519 directories for 164 cases, so how to merge them. they should be (164 * 3= 492). And how to match UUID to TCGA-sample-ID.

ADD REPLY
0
Entering edit mode

These are three different files:

Fragment Count (HT-Seq) ——> Gene Count ——>Count Normalization —-> FPKM ——>Upper Quantile Normalization ——>FPKM-UQ

https://gdc.nci.nih.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification

but how to map UUIDs with TCGA-patient.bar.code ID

ADD REPLY
0
Entering edit mode

Hi! I had the same UUIDs to TCGA ID problem. I solved it using R to write a JSON sentence that is the used in the command line

Here i wrote a post about it

Hope it is usefull!

ADD REPLY
1
Entering edit mode

I really think something needs to done about the gdc-client tool. I cannot install it on Mac OS X sierra... Downloaded the tool more than five time, unpacked, double click and all I get is the same error as given below:

Musalulas-MacBook-Pro:~ sinkala$ /Users/sinkala/Downloads/gdc-client ; exit; usage: gdc-client [-h] [--version] {download,upload,interactive} ... gdc-client: error: too few arguments logout Saving session... ...copying shared history... ...saving history...truncating history files... ...completed.

[Process completed]

I have also tried the alternate ways of installing the thing, but I have not been successful either. I have tried to download the data directly from the data portal; even that does not work for a file size less than 400mb - the server does not respond or something like that. :( :(

ADD REPLY
1
Entering edit mode

The gdc-client is a command-line tool. You cannot just double-click on it. See https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/. If you have other problems, feel free to contact the gdc support staff: support@nci-gdc.datacommons.io.

ADD REPLY
0
Entering edit mode

Hi, I am new to BioStars so apologies for any syntax errors.

So after trying to start the gdc-client.exe application it presents with the following error (then disappears):

usage: gdc-client [-h] [--version] {download,upload,interactive} 
gdc-client: error: too few arguments

How to solve:

You need to run the program using command line; the hideous interface summoned by typing 'cmd' into the start menu. You need to first set a Path to the folder which contains the unzipped "gdc-client.exe" file.

Here is a guide:https://www.wikihow.com/Run-a-Program-on-Command-Prompt

After doing this you can start using commands for the gdc-client program. For example, type in the "gdc-client download" command followed by "-m" for manifest, then the file location:

gdc-client download -m  /Users/JohnDoe/Downloads/gdc_manifest_6746fe840d924cf623b4634b5ec6c630bd4c06b5.txt

If you don't know how to make a manifest go here: http://www.andrewjanowczyk.com/download-tcga-digital-pathology-images-ffpe/

Finally

If you start getting error messages about there being 'no such file or directory' try dragging the manifest file into the same folder your gdc-client.exe application is in, then simply type in the command followed by the manifest file name:

gdc-client download -m gdc_manifest_20181207_182951.txt

Now that the file is in the folder your path is set to it doesn't need the location specified. It will then start downloading (hopefully)!

Enjoy

ADD REPLY
0
Entering edit mode

Install download software: GDC Data Transfer Tool (Linux)

Please someone there.. could you helpe me.. I'm having trouble intalling gdc-client on ubuntu 14 I downloaded the zip gdc and extracted after that, on the shell i wrote.. ./ gdc-client.. but nothing happen,,

ADD REPLY
0
Entering edit mode

You can always contact support@nci-gdc.datacommons.io for support.

ADD REPLY
0
Entering edit mode

Try using chmod to change the permissions before executing the file.

ADD REPLY
0
Entering edit mode

I'm using gdc-client v1.2.0. I specifically sort my manifest file by patient id so I may download tumor-normal pair BAMs one after another. But in reality, BAMs were downloaded in some random order, which is not from the top to bottom of my sorted manifest file. Do other people have the same problem? Is there a way to fix it?

ADD REPLY
0
Entering edit mode

Hi, I have download GDC client tool to download files from GDC. As the download folder should contain data or zipped data and logs folder. My files are downloaded successfully. However, I see only few logs folder. For example for Bladder Urothelial carcinoma (BLCA) manifest files includes 433 UUID. But only 53 logs folder were found. Thus could you let me know is the download files are accurate?

Thank you.

ADD REPLY
0
Entering edit mode

Does someone know how to fix this error?

 92% [##########################ERROR: Max retries exceeded.:02:27  16.23 MB/s 
ERROR: Max retries exceeded.
ERROR: Max retries exceeded.
ERROR: Max retries exceeded.
ADD REPLY
0
Entering edit mode

Internet problem? tried several times and then failed. I guess it is the internet problem. or check the quota of the hard-disk

ADD REPLY
11
Entering edit mode
7.0 years ago

If you are looking for a flexible programmatic approach, you might take a look at the GenomicDataCommons Bioconductor package: https://bioconductor.org/packages/GenomicDataCommons

find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

library(GenomicDataCommons)
library(magrittr)
ge_manifest = files() %>% 
    filter( ~ cases.project.project_id == 'TCGA-OV' &
                type == 'gene_expression' &
                analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

Download data

The next code block downloads the 379 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds.

destdir = tempdir()
fnames = lapply(ge_manifest$id,gdcdata,
                destination_dir=destdir,overwrite=TRUE,
                progress=FALSE)

If the download had included controlled-access data, the download above would have needed to include a token.

ADD COMMENT
0
Entering edit mode

Sean, for recent requests of access to the data, it seems that users are forwarded here: https://dcc.icgc.org/

From there, approved users can obtain an access token but it seems to not cover all data. Most importantly, it doesn't cover the mirror where TCGA data is hosted (GDC Chicago). How does one actually obtain a GDC access token? A lot of the programs and services appear to have been shut.

ADD REPLY
1
Entering edit mode

The ICGC is not the right place to get access to TCGA controlled-access data, as you point out. To gain access to controlled-access TCGA data, one needs to apply through dbGaP. The process is documented here:

https://gdc.cancer.gov/access-data/obtaining-access-controlled-data

After approval for controlled-access data, you can login to the GDC data portal to get your access token (the download link will be under your username after logging in).

ADD REPLY
0
Entering edit mode

Thanks Sean - that's what I expected. We are currently awaiting dbGaP approval.

ADD REPLY
4
Entering edit mode
7.9 years ago
Chun-Jie Liu ▴ 280

For the CentOS, you need to download the gdc-client source code to compile yourself.

gdc-client github issued this problem that glibc 2.12 is the latest that's available for CentOS 6.

If your system is CentOS release 6.6, I think you should download the gdc-client source code and compile it yourself. gdc-client is based on the py2.

  1. git clone https://github.com/NCI-GDC/gdc-client
  2. python setup.py install

You may meet the problem

The 'lxml==3.5.0b1' distribution was not found and is required by gdc-client

or

ImportError: /usr/lib64/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by lxml/etree.so)

You need to install libxslt and libxml2 in your home path. And add xml2-config and xslt-config to your path. export PATH="/prog_path/libxslt-1.1.29/bin:/prog_path/libxml2-2.9.4/bin:$PATH"

Then

  1. pip uninstall lxml
  2. pip install lxml==3.5.0b1 --install-option="--auto-rpath"

Finnaly, compile gdc-client source code.

  1. python setup.py install

It worked.

ADD COMMENT
3
0
Entering edit mode

Hi Shicheng

Thanks for the detailed view. One more clarification: When I try to download the WGS (whole genome sequencing data) for, say Breast cancer (TCGA-BRCA) from GDC Legacy, the second column of the manifest file for the same has some ids which are not sample IDs. What are those? e.g

01aa8d222c93eac50081544889046aeb.bam 01e2ea9ed2554ea6df56ed963414b511.bam

etc. If these are also samples then how to retrieve their corresponding TCGA ids?

Thanks in advance.

ADD REPLY
0
Entering edit mode

@aanchalsharma833
GDC provides an API, and you can get info by retrieving from GDC_API. I write a simple script on my GitHub to map file_id to TCGA barcode (submitter_id in GDC). The TCGA barcode is supposed to provide sample info, script extracts both sample type and TCGA barcode.

Input is the manifest file you downloaded from GDC. The output file is mapped file which title is generated automatically by GDC_API. Hope it useful.

ADD REPLY

Login before adding your answer.

Traffic: 2419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6