1

Question

Tutorial:TCGA UUIDS to TCGA barcode (SampleID) in R

28

Entering edit mode

8.2 years ago

martinguerrerog89 ▴ 310

Update, 12th May 2018

Since this post was made, a rapid way to interrogate the GDC data for the purposes of converting UUIDs to TCGA Barcodes was found using R Programming Language. See the thread here: Sample names for TCGA data from GDC-legacy archive

Kevin,
Moderator.

For those not familiar with the command line and with the JSON query language, here is a fairly simple way to map UUIDS to TCGA barcode ID using R and a canned command in the terminal

The first part is in R

1) Extract the files ID from your manifest file (the one you get from the GDC after you downloaded your data)

setwd("C:/Here/your/manifest/directory")

manifest= "gdc_manifest_20160921_171519.txt" #Manifest name 
x=read.table(manifest,header = T)
manifest_length= nrow(x)
id= toString(sprintf('"%s"', x$id))

2) Create Payload.txt with the commands needed

This commands are extracted from the GDC website https://gdc-docs.nci.nih.gov/API/Users_Guide/Search_and_Retrieval/

Part1= '{"filters":{"op":"in","content":{"field":"files.file_id","value":[ '
Part2= '] }},"format":"TSV","fields":"file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id","size":'
Part3= paste(shQuote(manifest_length),"}",sep="")
Sentence= paste(Part1,id,Part2,Part3, collapse=" ")
write.table(Sentence,"Payload.txt",quote=F,col.names=F,row.names=F)

The second part is in the command line (CMD or terminal)

cd C:/Here/your/manifest/directory
curl --request POST --header "Content-Type: application/json" --data @Payload.txt "https://gdc-api.nci.nih.gov/files" > File_metadata.txt

Now you should have a file called File_metadata.txt in your working folder with all the data you need

If you get a message:

'curl' is not recognized as an operable program or batch file.

you should install the cURL library in your computer (if you don't know how to do it, follow this link)

next-gen GDC R TCGA • 21k views

ADD COMMENT • link updated 8 months ago by aUser ▴ 70 • written 8.2 years ago by martinguerrerog89 ▴ 310

1

Entering edit mode

Hi, I try use this method for retrieving the sample ID, but it failed, the error in the File_metadata.txt is: { "message": "400 Bad Request: The browser (or proxy) sent a request that this server could not understand." }

how to fix it? Thanks.

ADD REPLY • link 7.4 years ago by summer007 ▴ 10

0

Entering edit mode

Can you post your code?

ADD REPLY • link 5.7 years ago by jflopezfernandez ▴ 50

0

Entering edit mode

Another answer here: A: Sample names for TCGA data from GDC-legacy archive Also check the blog of Seán Davis.

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you. It worked for my prostate cancer RNA-seq data.

ADD REPLY • link 7.6 years ago by morovatunc ▴ 560

0

Entering edit mode

thanks for the post, it was very useful ..

ADD REPLY • link 7.5 years ago by juanmafernandezm86 • 0

0

Entering edit mode

Thanks for this convenient solution!

I observed two small issues from my implementation.

1

The current URL for this search should be "https://api.gdc.cancer.gov/files" rather than "https://gdc-api.nci.nih.gov/files". I have "curl: (6) Could not resolve host: gdc-api.nci.nih.gov; Unknown error" using the latter.

2

In the R script

Part3= paste(shQuote(manifest_length),"}",sep="")

this will result in single quote of the size, which caused error in my searching. After the following modification, it worked for me.

Part3= paste0("\"",manifest_length, "\"", "}") #just change single quote ' to double quote "

Thank you very much!

ADD REPLY • link 6.3 years ago by ginnyli056 ▴ 10

0

Entering edit mode

A better solution: C: Sample names for TCGA data from GDC-legacy archive

ADD REPLY • link 6.3 years ago by Kevin Blighe 88k

WouterDeCoster · Answer 1 · 2016-12-15

2

Entering edit mode

8.0 years ago

Chun-Jie Liu ▴ 280

GDC provides API for Curl and HTTPie for command retrieving info through the UUID.

I wrote a simple python script for mapping UUID to TCGA barcode (submitterID). Just input the manifest file downloaded from GDC Data-Portal. Defaul is latest version, you can't use legacy archive UUID to convert through latest version. You may change the endpoint to your version by yourself.

files_endpt = "https://gdc-api.nci.nih.gov/<version>/legacy/<endpoint>"

ADD COMMENT • link 8.0 years ago by Chun-Jie Liu ▴ 280

0

Entering edit mode

Hi, there is some errors when I used it,

Traceback (most recent call last):
  File "m2s.py", line 76, in <module>
    main()
  File "m2s.py", line 73, in main
    run(args.manifest)
  File "m2s.py", line 69, in run
    gdcAPI(file_ids, manifest)
  File "m2s.py", line 62, in gdcAPI
    response = requests.post(files_endpt, json = params)
  File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 88, in post
    return request('post', url, data=data, **kwargs)
  File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'json'

Thanks,

ADD REPLY • link updated 7.4 years ago by WouterDeCoster 47k • written 7.4 years ago by summer007 ▴ 10

1

Entering edit mode

Please try python3. And add requests module.

Or you can use R version

ADD REPLY • link 7.4 years ago by Chun-Jie Liu ▴ 280

0

Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi, I tried with python3 but getting same error like this Traceback (most recent call last): File "m2s.py", line 76, in <module> main() File "m2s.py", line 73, in main run(args.manifest) File "m2s.py", line 69, in run gdcAPI(file_ids, manifest) File "m2s.py", line 62, in gdcAPI response = requests.post(files_endpt, json = params) File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 88, in post return request('post', url, data=data, *kwargs) File "//anaconda/lib/python2.7/site-packages/requests/api.py", line 44, in request return session.request(method=method, url=url, *kwargs) TypeError: request() got an unexpected keyword argument 'json'

Any help is much appreciated.

Thanks

ADD REPLY • link 6.2 years ago by archana.bioinfo87 ▴ 210

score 2 · Answer 2 · 2019-03-07

I had a hard time getting a number of these answers to work... eventually using the following sequence to convert legacy UUIDs to full barcodes and harmonized UUIDs:

Firstly, you need to download the JSON manifest files from your selected study and file types from the GDC legacy archive (NOT GDC portal).

Then, use the following R script:

library(dplyr)
library(jsonlite)
legacy = fromJSON(txt = "~/Downloads/metadata.cart.2019-03-07 (1).json")
legfnames = legacy[["file_id"]]
entities = legacy[["associated_entities"]]
IDconversion = bind_rows(entities, .id = "column_label")
IDconversion['legacy file names'] = legfnames

score 0 · Answer 3 · 2024-03-19

0

Entering edit mode

8 months ago

aUser ▴ 70

Great solutions on this page, the old web page has been replace with new one, so you need to use the new link, and it will work:

old link: https://gdc-api.nci.nih.gov/files

New link: https://api.gdc.cancer.gov/files

ADD COMMENT • link 8 months ago by aUser ▴ 70