Question

trying to get TCGA Sample ID (legacy barcode style sample_id) associated with downloaded files

0

Entering edit mode

5.6 years ago

markddesimone ▴ 60

Hi, This summer I downloaded a few transcriptome files from TCGA. The file list contained the legacy barcode style sample ids, e.g. TCGA-IN-7806-11A

e.g.

File ID File Name   Data Category   Data Type   Project ID  Case ID Sample ID   Sample Type

4b33d026-fa23-4163-96ec-6c9b7afc91eb    b5e38453-763a-44cb-a328-13b322e0b6b4.htseq.counts.gz    Transcriptome Profiling Gene Expression Quantification  TCGA-STAD   TCGA-IN-7806    TCGA-IN-7806-01A    Primary Tumor

2f780fee-cb3b-4d70-9f8c-c3b083e57526    16afa649-9060-4b95-8072-f018bd4854cc.htseq.counts.gz    Transcriptome Profiling Gene Expression Quantification  TCGA-STAD   TCGA-IN-7806    TCGA-IN-7806-11A    Solid Tissue Normal

However... now when I download more data the Sample ID is no longer included.... I need the Sample ID so I can correlate these files with other data I have received where the sample is also identified by the legacy style Sample ID.

This legacy SampleID is known as submitter_sample_id in https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/

I have tried to retrieve it with the following query (based on the examples given here:https://docs.gdc.cancer.gov/API/Users_Guide/Python_Examples/#complex-filters

fields = [
    "file_name",
    "cases.submitter_id",
    "cases.submitter_sample_ids",
    "cases.samples.sample_type",
    "cases.disease_type",
    "cases.project.project_id"
    ]

fields = ",".join(fields)

files_endpt = "https://api.gdc.cancer.gov/files"

# This set of filters is nested under an 'and' operator.
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "cases.submitter_id",
            "value": ["TCGA-IN-7806"]
            }
        },
        {
        "op": "in",
        "content":{
                "field": "files.experimental_strategy",
                "value": ["RNA-Seq"]
            }
        }
    ]
}

# A POST is used, so the filter parameters can be passed directly as a Dict object.
params = {
    "filters": filters,
    "fields": fields,
    "format": "JSON",
    "size": "2"
    }

# The parameters are passed to 'json' rather than 'params' in this case
response = requests.post(files_endpt, headers = {"Content-Type": "application/json"}, json = params)

print(json.dumps(response.json(), indent=2))

everything is returned except cases.samples.submitter_sample_ids:

{
  "data": {
    "hits": [
      {
        "file_name": "1fe789b0-77e7-481e-a4df-fe4c9517fb2c_gdc_realn_rehead.bam",
        "cases": [
          {
            "project": {
              "project_id": "TCGA-STAD"
            },
            "disease_type": "Adenomas and Adenocarcinomas",
            "samples": [
              {
                "sample_type": "Primary Tumor"
              }
            ],
            "submitter_id": "TCGA-IN-7806"
          }
        ],
        "id": "01e937b4-9d9b-45d3-b7eb-5665ae0900c0"
      },
      {
        "file_name": "b5e38453-763a-44cb-a328-13b322e0b6b4.htseq.counts.gz",
        "cases": [
          {
            "project": {
              "project_id": "TCGA-STAD"
            },
            "disease_type": "Adenomas and Adenocarcinomas",
            "samples": [
              {
                "sample_type": "Primary Tumor"
              }
            ],
            "submitter_id": "TCGA-IN-7806"
          }
        ],
        "id": "4b33d026-fa23-4163-96ec-6c9b7afc91eb"
      }
    ],
    "pagination": {
      "count": 2,
      "sort": "",
      "from": 0,
      "page": 1,
      "total": 8,
      "pages": 4,
      "size": 2
    }
  },
  "warnings": {}
}

I have been able to find the new uuid style sample_id associated with the file_id/file_name but need the legacy style: I had thought I had solved it with this search:

cases_endpt = 'https://api.gdc.cancer.gov/cases'
params = {
    "fields":'sample_ids,submitter_sample_ids',#'submitter_sample_id, samples_id, submitter_id',
    "filters": json.dumps({
        "op": "in",
        "content":{
            "field": "submitter_sample_ids",
            "value": ["TCGA-IN-7806-01A"]
            }
        }),
    "format":"JSON",
    "size": "1000000"
}
response = requests.get(cases_endpt, params = params)
print(json.dumps(response.json(), indent=2))

resulting in:

{
  "data": {
    "hits": [
      {
        "submitter_sample_ids": [
          "TCGA-IN-7806-01A",
          "TCGA-IN-7806-01Z",
          "TCGA-IN-7806-11A",
          "TCGA-IN-7806-10A"
        ],
        "sample_ids": [
          "6f44b130-0206-4f2f-b47b-c587a8c1898b",
          "9b864c79-bf5c-4430-86e8-d8479ed90d25",
          "291c5592-1d7b-46d6-a626-e5450c4851e8",
          "bc3a2a90-b94b-4770-899e-19fb5f2e65c5"
        ],
        "id": "87c217d4-66f4-46ca-8244-7856ce658fd3"
      }
    ],
    "pagination": {
      "count": 1,
      "sort": "",
      "from": 0,
      "page": 1,
      "total": 1,
      "pages": 1,
      "size": 1000000
    }
  },
  "warnings": {}
}

But it turns out I can't rely on the order of submitter_sample_ids matching sample_ids

Does anyone know any better method of retrieving the legacy style SampleID given a file name or file_id?

thank you

TCGA • 3.5k views

ADD COMMENT • link 5.6 years ago by markddesimone ▴ 60

score 1 · Answer 1 · 2020-01-07

1

Entering edit mode

5.6 years ago

markddesimone ▴ 60

I also received a response from TCGA support.

The biospecimen tsv download contains a file called samples.tsv which contains the required ids. I had missed downloading it this time as I had used the gdc download tool using a manifest.

ADD COMMENT • link 5.6 years ago by markddesimone ▴ 60

score 0 · Answer 2 · 2020-01-07

This blog https://seandavi.github.io/post/2017-12-29-genomicdatacommons-id-mapping/ helped identify that the field I'm actually looking for is:

cases.samples.submitter_id

not:

cases.submitter_sample_ids

Therefore the desired query is:

params = {
    "fields": "cases.samples.submitter_id",
    "format": "JSON",
    "size": "1000000"
    }

response = requests.get(files_endpt, params = params)