trying to get TCGA Sample ID (legacy barcode style sample_id) associated with downloaded files
2
0
Entering edit mode
4.9 years ago

Hi, This summer I downloaded a few transcriptome files from TCGA. The file list contained the legacy barcode style sample ids, e.g. TCGA-IN-7806-11A

e.g.

File ID File Name   Data Category   Data Type   Project ID  Case ID Sample ID   Sample Type
4b33d026-fa23-4163-96ec-6c9b7afc91eb    b5e38453-763a-44cb-a328-13b322e0b6b4.htseq.counts.gz    Transcriptome Profiling Gene Expression Quantification  TCGA-STAD   TCGA-IN-7806    TCGA-IN-7806-01A    Primary Tumor
2f780fee-cb3b-4d70-9f8c-c3b083e57526    16afa649-9060-4b95-8072-f018bd4854cc.htseq.counts.gz    Transcriptome Profiling Gene Expression Quantification  TCGA-STAD   TCGA-IN-7806    TCGA-IN-7806-11A    Solid Tissue Normal

However... now when I download more data the Sample ID is no longer included.... I need the Sample ID so I can correlate these files with other data I have received where the sample is also identified by the legacy style Sample ID.

This legacy SampleID is known as submitter_sample_id in https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/

I have tried to retrieve it with the following query (based on the examples given here:https://docs.gdc.cancer.gov/API/Users_Guide/Python_Examples/#complex-filters

fields = [
    "file_name",
    "cases.submitter_id",
    "cases.submitter_sample_ids",
    "cases.samples.sample_type",
    "cases.disease_type",
    "cases.project.project_id"
    ]

fields = ",".join(fields)

files_endpt = "https://api.gdc.cancer.gov/files"

# This set of filters is nested under an 'and' operator.
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "cases.submitter_id",
            "value": ["TCGA-IN-7806"]
            }
        },
        {
        "op": "in",
        "content":{
                "field": "files.experimental_strategy",
                "value": ["RNA-Seq"]
            }
        }
    ]
}

# A POST is used, so the filter parameters can be passed directly as a Dict object.
params = {
    "filters": filters,
    "fields": fields,
    "format": "JSON",
    "size": "2"
    }

# The parameters are passed to 'json' rather than 'params' in this case
response = requests.post(files_endpt, headers = {"Content-Type": "application/json"}, json = params)

print(json.dumps(response.json(), indent=2))

everything is returned except cases.samples.submitter_sample_ids:

{
  "data": {
    "hits": [
      {
        "file_name": "1fe789b0-77e7-481e-a4df-fe4c9517fb2c_gdc_realn_rehead.bam",
        "cases": [
          {
            "project": {
              "project_id": "TCGA-STAD"
            },
            "disease_type": "Adenomas and Adenocarcinomas",
            "samples": [
              {
                "sample_type": "Primary Tumor"
              }
            ],
            "submitter_id": "TCGA-IN-7806"
          }
        ],
        "id": "01e937b4-9d9b-45d3-b7eb-5665ae0900c0"
      },
      {
        "file_name": "b5e38453-763a-44cb-a328-13b322e0b6b4.htseq.counts.gz",
        "cases": [
          {
            "project": {
              "project_id": "TCGA-STAD"
            },
            "disease_type": "Adenomas and Adenocarcinomas",
            "samples": [
              {
                "sample_type": "Primary Tumor"
              }
            ],
            "submitter_id": "TCGA-IN-7806"
          }
        ],
        "id": "4b33d026-fa23-4163-96ec-6c9b7afc91eb"
      }
    ],
    "pagination": {
      "count": 2,
      "sort": "",
      "from": 0,
      "page": 1,
      "total": 8,
      "pages": 4,
      "size": 2
    }
  },
  "warnings": {}
}

I have been able to find the new uuid style sample_id associated with the file_id/file_name but need the legacy style: I had thought I had solved it with this search:

cases_endpt = 'https://api.gdc.cancer.gov/cases'
params = {
    "fields":'sample_ids,submitter_sample_ids',#'submitter_sample_id, samples_id, submitter_id',
    "filters": json.dumps({
        "op": "in",
        "content":{
            "field": "submitter_sample_ids",
            "value": ["TCGA-IN-7806-01A"]
            }
        }),
    "format":"JSON",
    "size": "1000000"
}
response = requests.get(cases_endpt, params = params)
print(json.dumps(response.json(), indent=2))

resulting in:

{
  "data": {
    "hits": [
      {
        "submitter_sample_ids": [
          "TCGA-IN-7806-01A",
          "TCGA-IN-7806-01Z",
          "TCGA-IN-7806-11A",
          "TCGA-IN-7806-10A"
        ],
        "sample_ids": [
          "6f44b130-0206-4f2f-b47b-c587a8c1898b",
          "9b864c79-bf5c-4430-86e8-d8479ed90d25",
          "291c5592-1d7b-46d6-a626-e5450c4851e8",
          "bc3a2a90-b94b-4770-899e-19fb5f2e65c5"
        ],
        "id": "87c217d4-66f4-46ca-8244-7856ce658fd3"
      }
    ],
    "pagination": {
      "count": 1,
      "sort": "",
      "from": 0,
      "page": 1,
      "total": 1,
      "pages": 1,
      "size": 1000000
    }
  },
  "warnings": {}
}

But it turns out I can't rely on the order of submitter_sample_ids matching sample_ids

Does anyone know any better method of retrieving the legacy style SampleID given a file name or file_id?

thank you

TCGA • 3.0k views
ADD COMMENT
1
Entering edit mode
4.9 years ago

I also received a response from TCGA support.

The biospecimen tsv download contains a file called samples.tsv which contains the required ids. I had missed downloading it this time as I had used the gdc download tool using a manifest.

ADD COMMENT
0
Entering edit mode
4.9 years ago

This blog https://seandavi.github.io/post/2017-12-29-genomicdatacommons-id-mapping/ helped identify that the field I'm actually looking for is:

cases.samples.submitter_id

not:

cases.submitter_sample_ids

Therefore the desired query is:

params = {
    "fields": "cases.samples.submitter_id",
    "format": "JSON",
    "size": "1000000"
    }

response = requests.get(files_endpt, params = params)
ADD COMMENT

Login before adding your answer.

Traffic: 1792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6