Sample names for TCGA data from GDC-legacy archive
2
7
Entering edit mode
6.7 years ago
Vasu ▴ 790

Hi,

As I needed RNAseq raw sequencing data I downloaded the rnaseq manifest file from GDC legacy archive and with the token I downloaded rnaseq raw data.

The manifest looks like this:

id  filename    md5 size    state
d1017f74-3a39-4427-af57-273e34247b49    UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz   ed7f23aa9540ef0242cb6ddde30d1aca    5830465428  live
5e2d5c52-596f-49bc-967c-42129abbacbf    UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz  b1f03852b2ac3c3cd50cb4a87f2a116a    7587398372  live
2ef74f93-5da2-454c-aca2-d86c289eacb8    UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz   a965da78ada814a35702fd65209b500a    7867889236  live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918    UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz   6e6a26fcce8e84d209b1475249a922de    5187498148  live
992a7083-28ce-4857-898e-9d4b4fbf2fa1    UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz   bb9e19a5f286ff37bf95cb0c307930ea    6717741168  live
230082b7-39ec-4fe1-b3c6-daf35458f396    UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz   d93777efebc921e2539aa2b7081da6d4    4766879929  live
9bbada51-d827-4eea-af45-47d7b5ba137e    UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz   7bcfe71256ba172fa605bc4ddc04f9c7    7606309309  live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2    UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz   281083ac338f67145ede3d8ef3f4300f    7504345358  live

After the download I have folders with names present in "id" column. Inside each folder there is tar.gz file.

For eg:

d1017f74-3a39-4427-af57-273e34247b49
                       |___ UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz

When I extracted the tar.gz files I got the fastq files like below:

110801_UNC12-SN629_0115_BD0DVEABXX.3_1.fastq
110801_UNC12-SN629_0115_BD0DVEABXX.3_2.fastq

What is the sample name here? Looks very confused.

RNA-Seq tcga gdc • 18k views
ADD COMMENT
2
Entering edit mode

As this is controlled data, you could log in at the main GDC ( https://portal.gdc.cancer.gov/ ) and use the search box to search for the file-names - they should be there. You would then obviously look for the UUID or TCGA barcode.

To do this programmatically, there are APIs but, the last time that I tried them, they were offline. There has been a lot of data being moved around relatively recently for the TCGA. One way that I did it was to download the JSON manifest for my data (from the Legacy Archive) and then use a loop in R to pull out the CASE ID (which is the UUID), in this case, which I then used to identify the patients. Here's the loop that I used (slow; sample filenames are in filenames object):

require(rjson)
manifest <- fromJSON(file="RNAseqManifest.json")

#Look up each filename's UUID from the manifest
fileUUIDs <- c()
for (i in 1:length(filenames))
{
    record <- manifest[[grep(filenames[i], manifest, fixed=TRUE, ignore.case=FALSE)]]

    if (filenames[i]!=record$file_name)
    {
        print("FALSE")
    }

    fileUUIDs[i] <- record$cases[[1]]$case_id
}
ADD REPLY
0
Entering edit mode

sorry, didn't get what is filenames object. what is sample filenames?

ADD REPLY
0
Entering edit mode

Just a vector of your filenames, such as:

filenames <- c("UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz",
  ...,
  "UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz")
ADD REPLY
0
Entering edit mode

As given above in the manifest I already have UUID. What I need is TCGA sample name. For this do I need to login into GDC and search?

ADD REPLY
0
Entering edit mode

I see. For UUID-to-TCGA barcode mapping, I was able to just use one of the clinical data files in BioTab format (also available at Legacy Archive).

For example, here is the file for breast cancer: https://portal.gdc.cancer.gov/legacy-archive/files/735bc5...

The first 2 columns of that file are:

  • bcr_patient_uuid
  • bcr_patient_barcode
ADD REPLY
0
Entering edit mode

Somehow the below code is not working anymore. However "UUIDtoBarcode" function from TCGAutils R package is giving the solution.

ADD REPLY
0
Entering edit mode

That is expected. The function that I wrote above is inefficient and served a specific purpose at that time. Did you try Vasu's (Sean Davis's) code below?

ADD REPLY
11
Entering edit mode
6.7 years ago
Vasu ▴ 790

Best way to do it.

library(GenomicDataCommons)
manifest <- read.table("gdc_manifest_rnaseq_fastq.txt")

manifest:

id  filename    md5 size    state
d1017f74-3a39-4427-af57-273e34247b49    UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz   ed7f23aa9540ef0242cb6ddde30d1aca    5830465428  live
5e2d5c52-596f-49bc-967c-42129abbacbf    UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz  b1f03852b2ac3c3cd50cb4a87f2a116a    7587398372  live
2ef74f93-5da2-454c-aca2-d86c289eacb8    UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz   a965da78ada814a35702fd65209b500a    7867889236  live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918    UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz   6e6a26fcce8e84d209b1475249a922de    5187498148  live
992a7083-28ce-4857-898e-9d4b4fbf2fa1    UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz   bb9e19a5f286ff37bf95cb0c307930ea    6717741168  live
230082b7-39ec-4fe1-b3c6-daf35458f396    UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz   d93777efebc921e2539aa2b7081da6d4    4766879929  live
9bbada51-d827-4eea-af45-47d7b5ba137e    UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz   7bcfe71256ba172fa605bc4ddc04f9c7    7606309309  live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2    UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz   281083ac338f67145ede3d8ef3f4300f    7504345358  live

file_uuids <- manifest$id
head(file_uuids)

d1017f74-3a39-4427-af57-273e34247b49
5e2d5c52-596f-49bc-967c-42129abbacbf
2ef74f93-5da2-454c-aca2-d86c289eacb8
e01ca3e0-beb0-46b7-bb7c-f5b16f966918
992a7083-28ce-4857-898e-9d4b4fbf2fa1
230082b7-39ec-4fe1-b3c6-daf35458f396
9bbada51-d827-4eea-af45-47d7b5ba137e
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_ids, legacy = TRUE) {
    info = files(legacy = legacy) %>%
        filter( ~ file_id %in% file_ids) %>%
        select('cases.samples.submitter_id') %>%
        results_all()
    # The mess of code below is to extract TCGA barcodes
    # id_list will contain a list (one item for each file_id)
    # of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
    id_list = lapply(info$cases,function(a) {
        a[[1]][[1]][[1]]})
    # so we can later expand to a data.frame of the right size
    barcodes_per_file = sapply(id_list,length)
    # And build the data.frame
    return(data.frame(file_id = rep(ids(info),barcodes_per_file),
                      submitter_id = unlist(id_list)))
    }

res = TCGAtranslateID(file_uuids)
head(res)

file_id                                   Submitter_id       
d1017f74-3a39-4427-af57-273e34247b49    TCGA-E9-A1NA-11A
5e2d5c52-596f-49bc-967c-42129abbacbf    TCGA-AO-A12H-01A
2ef74f93-5da2-454c-aca2-d86c289eacb8    TCGA-AC-A23E-01A

I found the soution from seandavis blog https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/

ADD COMMENT
3
Entering edit mode

For the record, Bioinfo's answer is Sean Davis's blog post which can be seen here: https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/

If you are using content that is not yours, please cite it.

Also, we've added this functionality with Sean's permission to the TCGAutils package on Bioconductor.

Best regards, Marcel

ADD REPLY
0
Entering edit mode

Thank you for catching that, Marcel.

ADD REPLY
0
Entering edit mode

I just posted the answer how I solved the problem..Yes may be I must have posted the link. I totally forgot about that while posting the solution. Sorry for that.

ADD REPLY
1
Entering edit mode

Yes, you most definitely should have posted the link especially if the content was directly taken from the blog. You may not have intended to omit the attribution, but that omission makes you look bad. Gotta be extra careful, unfortunately.

ADD REPLY
0
Entering edit mode

Sure, I will be little careful from next time. thanq

ADD REPLY
1
Entering edit mode

That's pretty cool - I've moved this to an answer. Please feel free to Accept it, as this will help others.

ADD REPLY
0
Entering edit mode

However, I received following error in Rstudio

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('gdc_files', 'GDCQuery', 'list')"

ADD REPLY
0
Entering edit mode

It should be "filter", not "filter_"

ADD REPLY
0
Entering edit mode

Bioinfo, this exact function was working 2 weeks ago. Something has changed, either at the GDC server or in the version of the GenomicDataCommons package (or elsewhere)

I was trying it yesterday and it neither worked, unfortunately (tried from 2 different places).

ADD REPLY
0
Entering edit mode

Just now I gave a try with the above mentioned code. It worked for me.

ADD REPLY
0
Entering edit mode

I tried it just now and it now gives this:

Error in .gdc_post(entity_name(x), body = body, legacy = x$legacy, token = NULL,  :
  Not Found (HTTP 404).
In addition: Warning message:
In strptime(x, fmt, tz = "GMT") :
  unknown timezone 'zone/tz/2018c.1.0/zoneinfo/Europe/London'
ADD REPLY
0
Entering edit mode

It could be due to the versions. Not sure. These are the versions I'm using.

GenomicDataCommons_1.5.3
BiocInstaller_1.31.1
R version 3.5.0
ADD REPLY
2
Entering edit mode

I modified the function to accept filenames, too: C: problem in matching the names between file names and patients Id in TCGA

Thanks.

ADD REPLY
2
Entering edit mode

Also got the function working again by updating directly from GitHub:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")
ADD REPLY
1
Entering edit mode

hi,

I tried your code but getting error like this Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: gdc-api.nci.nih.gov

Any suggestion or help is much appreciated.

Thanks

ADD REPLY
1
Entering edit mode

Try to install the development version of GenomicDataCommons:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")
ADD REPLY
1
Entering edit mode
5.7 years ago
david.peeney ▴ 30

copying my answer from Tutorial: TCGA UUIDS to TCGA barcode (SampleID) in R

Using the GenomicDataCommons method only gives you short barcodes (for identifying patients), which is not particularly useful when dealing with duplicate samples. A good method I found that gives me harmonized UUIDs and full aliquot barcodes is:

Firstly, you need to download the JSON manifest files from your selected study and file types from the GDC legacy archive (NOT GDC portal).

Then, use the following R script:

library(dplyr)
library(jsonlite)
legacy = fromJSON(txt = "~/Downloads/metadata.cart.2019-03-07 (1).json")
legfnames = legacy[["file_id"]]
entities = legacy[["associated_entities"]]
IDconversion = bind_rows(entities, .id = "column_label")
IDconversion['legacy file names'] = legfnames
ADD COMMENT
0
Entering edit mode

@david.peeney, How and where did you download the metadata.cart* from GDC?

ADD REPLY
0
Entering edit mode

You obtain it from the GDC Data Portal by selecting the samples that you need and then downloading the JSON file: f

ADD REPLY
0
Entering edit mode

@kelvin, Thanks for the reply. I have 350 samples. And I would like to have a more efficient method for all samples than manually downloading the metadata file. Also is there a way to have the metadata file in tsv fomart given the manifest data with the UUID's

ADD REPLY
0
Entering edit mode

Hey, yes, you can obtain a TSV file, too. Can you clarify what you are trying to convert, and to what you want to convert it?

ADD REPLY
0
Entering edit mode

Ok, I have 320 AML aligned exon files (BAM or *_gdc_realn.bam ) files, in addition to that, I have their manifest data with UUID, filename, md5, size, adn state. I would to have their metadat as tsv file. Currently, all I have is these BAM files and there manifest data. All I need is, the metadatafile with follwoing information,

study   center  tcga_id analysis_id accession   participant_id  sample_id   refassembly mark_duplicates exome_bed
ADD REPLY
0
Entering edit mode

For your study, there should be clinical data files that also provide a lot of information - these are available from the GDC, too. Filter for the BCR Biotab files.

There is likely a programmatic way, too, but I cannot think of one for now.

ADD REPLY
0
Entering edit mode

Thank you. I have downloaded the tsv files from clinical information from the portal directly. however, I have no information for columns, say for example, mark_duplicates refassembly etc.

ADD REPLY
0
Entering edit mode

I see. Information on those can likely be found in the SAM headers within each file. In addition, based on the overview of the DNA-seq analysis pipeline (HERE), it seems that PCR/optical duplicates are marked and that the ref assembly is GRCh38.d1.vd1

ADD REPLY
1
Entering edit mode

Ok, I see so such informatiin should be generated directly from the BAM file. Ok thanks.I thought I will have them seperetely as metedata. Ok Thanks for clarification

ADD REPLY
1
Entering edit mode

Well, just always be meticulous with the TCGA data, i.e., introduce a lot of QC checking to ensure that you have the correct data... a lot of the TCGA was produced and has been duplicated and re-processed many times.

ADD REPLY

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6