How to retrieve metadata from the manifest data and UUID from genomic data commons (GDC)
0
0
Entering edit mode
5.5 years ago
a.james ▴ 240

Hello All,

I have exon datasets (aligned BAM) files downloaded. I now need the metadata information for the same samples. How can I download or extract them given the manifest information of all samples.

I have read through the GDC API, however, I am not clear how could get the metadata as tsv file.

I saw the following shell script using curl.

curl --request POST --header "Content-Type: application/json" --data @Payload.txt 'https://api.gdc.cancer.gov/files' > File_metadata.txt

However, I dont understand where to give a list or set of UUID from the maifest file.

My questions:

  1. I Have UUIDs in my manifest files for all samples or datset I have downloaded, now I need metadata file in tsv format.
  2. How can download it? is there a shell or python script for the same ?

Any help/suggestions are appreciated !

tcga exon mutations alignment next-gen • 3.0k views
ADD COMMENT
1
Entering edit mode

What information in metadata exactly that you're looking for? Hope the following codes could provide some ideas about this (make sure to have jq installed):

UUIDs=("d853e541-f16a-4345-9f00-88e03c2dc0bc 74e522c6-0aad-4b9e-8d65-fe7b6da10046")

for UUID in ${UUIDs}; do
  curl -s https://api.gdc.cancer.gov/files/${UUID} \
    | jq -r '.data | "\(.data_type)\t\(.file_name)\t\(.data_format)\t\(.data_category)\t\(.experimental_strategy)"'
done

Which returns:

Aligned Reads   0017ba4c33a07ba807b29140b0662cb1_gdc_realn.bam  BAM Sequencing Reads    WXS
Gene Expression Quantification  2d9744c1-0b8e-48e2-a4a5-0bbc7a637bbf.FPKM.txt.gz    TXT Transcriptome Profiling RNA-Seq

Other metadata could be (vary from UUID to UUID):

{
  "data": {
    "data_release": "12.0 - 18.0",
    "data_type": "Aligned Reads",
    "updated_datetime": "2019-05-17T23:21:18.237724+00:00",
    "created_datetime": "2016-05-26T17:06:40.003624-05:00",
    "file_name": "0017ba4c33a07ba807b29140b0662cb1_gdc_realn.bam",
    "md5sum": "a08304b120c5df76b6532da0e9a35ced",
    "data_format": "BAM",
    "acl": [
      "phs000178"
    ],
    "access": "controlled",
    "platform": "Illumina",
    "state": "released",
    "version": "1",
    "file_id": "d853e541-f16a-4345-9f00-88e03c2dc0bc",
    "data_category": "Sequencing Reads",
    "file_size": 23650901931,
    "submitter_id": "c30188d7-be1a-4b43-9a17-e19ccd71792e",
    "type": "aligned_reads",
    "experimental_strategy": "WXS"
  },
  "warnings": {}
}
ADD REPLY
0
Entering edit mode

Thank you I need the following information in the metadata file,

study   tcga_id     analysis_id  refassembly     mark_duplicates    exome_bed

However, I I can have them also in from BAM header, but was looking for a programmatic way, way.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Thanks, Kevin! For the cross-reference.

ADD REPLY

Login before adding your answer.

Traffic: 1972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6