Hi, This summer I downloaded a few transcriptome files from TCGA. The file list contained the legacy barcode style sample ids, e.g. TCGA-IN-7806-11A
e.g.
File ID File Name Data Category Data Type Project ID Case ID Sample ID Sample Type
4b33d026-fa23-4163-96ec-6c9b7afc91eb b5e38453-763a-44cb-a328-13b322e0b6b4.htseq.counts.gz Transcriptome Profiling Gene Expression Quantification TCGA-STAD TCGA-IN-7806 TCGA-IN-7806-01A Primary Tumor
2f780fee-cb3b-4d70-9f8c-c3b083e57526 16afa649-9060-4b95-8072-f018bd4854cc.htseq.counts.gz Transcriptome Profiling Gene Expression Quantification TCGA-STAD TCGA-IN-7806 TCGA-IN-7806-11A Solid Tissue Normal
However... now when I download more data the Sample ID is no longer included.... I need the Sample ID so I can correlate these files with other data I have received where the sample is also identified by the legacy style Sample ID.
This legacy SampleID is known as submitter_sample_id in https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/
I have tried to retrieve it with the following query (based on the examples given here:https://docs.gdc.cancer.gov/API/Users_Guide/Python_Examples/#complex-filters
fields = [
"file_name",
"cases.submitter_id",
"cases.submitter_sample_ids",
"cases.samples.sample_type",
"cases.disease_type",
"cases.project.project_id"
]
fields = ",".join(fields)
files_endpt = "https://api.gdc.cancer.gov/files"
# This set of filters is nested under an 'and' operator.
filters = {
"op": "and",
"content":[
{
"op": "in",
"content":{
"field": "cases.submitter_id",
"value": ["TCGA-IN-7806"]
}
},
{
"op": "in",
"content":{
"field": "files.experimental_strategy",
"value": ["RNA-Seq"]
}
}
]
}
# A POST is used, so the filter parameters can be passed directly as a Dict object.
params = {
"filters": filters,
"fields": fields,
"format": "JSON",
"size": "2"
}
# The parameters are passed to 'json' rather than 'params' in this case
response = requests.post(files_endpt, headers = {"Content-Type": "application/json"}, json = params)
print(json.dumps(response.json(), indent=2))
everything is returned except cases.samples.submitter_sample_ids:
{
"data": {
"hits": [
{
"file_name": "1fe789b0-77e7-481e-a4df-fe4c9517fb2c_gdc_realn_rehead.bam",
"cases": [
{
"project": {
"project_id": "TCGA-STAD"
},
"disease_type": "Adenomas and Adenocarcinomas",
"samples": [
{
"sample_type": "Primary Tumor"
}
],
"submitter_id": "TCGA-IN-7806"
}
],
"id": "01e937b4-9d9b-45d3-b7eb-5665ae0900c0"
},
{
"file_name": "b5e38453-763a-44cb-a328-13b322e0b6b4.htseq.counts.gz",
"cases": [
{
"project": {
"project_id": "TCGA-STAD"
},
"disease_type": "Adenomas and Adenocarcinomas",
"samples": [
{
"sample_type": "Primary Tumor"
}
],
"submitter_id": "TCGA-IN-7806"
}
],
"id": "4b33d026-fa23-4163-96ec-6c9b7afc91eb"
}
],
"pagination": {
"count": 2,
"sort": "",
"from": 0,
"page": 1,
"total": 8,
"pages": 4,
"size": 2
}
},
"warnings": {}
}
I have been able to find the new uuid style sample_id associated with the file_id/file_name but need the legacy style: I had thought I had solved it with this search:
cases_endpt = 'https://api.gdc.cancer.gov/cases'
params = {
"fields":'sample_ids,submitter_sample_ids',#'submitter_sample_id, samples_id, submitter_id',
"filters": json.dumps({
"op": "in",
"content":{
"field": "submitter_sample_ids",
"value": ["TCGA-IN-7806-01A"]
}
}),
"format":"JSON",
"size": "1000000"
}
response = requests.get(cases_endpt, params = params)
print(json.dumps(response.json(), indent=2))
resulting in:
{
"data": {
"hits": [
{
"submitter_sample_ids": [
"TCGA-IN-7806-01A",
"TCGA-IN-7806-01Z",
"TCGA-IN-7806-11A",
"TCGA-IN-7806-10A"
],
"sample_ids": [
"6f44b130-0206-4f2f-b47b-c587a8c1898b",
"9b864c79-bf5c-4430-86e8-d8479ed90d25",
"291c5592-1d7b-46d6-a626-e5450c4851e8",
"bc3a2a90-b94b-4770-899e-19fb5f2e65c5"
],
"id": "87c217d4-66f4-46ca-8244-7856ce658fd3"
}
],
"pagination": {
"count": 1,
"sort": "",
"from": 0,
"page": 1,
"total": 1,
"pages": 1,
"size": 1000000
}
},
"warnings": {}
}
But it turns out I can't rely on the order of submitter_sample_ids matching sample_ids
Does anyone know any better method of retrieving the legacy style SampleID given a file name or file_id?
thank you