For retrieving rna-seq data from tcga https://gdc-portal.nci.nih.gov/, though I am able to download different expression level data and clinical data but I am struggling with mapping the two. I dont know how to map a certain patient's id in mRNA data to the corresponding id in clinical data. Any way I can sort this. Thanks in advance.
The thing is in clinical data all I have xml files from which I can still though extract patient barcode. But problem comes for the expression data I get each sample id which does not seem to correspond to barcode.
I got it in the end. I did not look at the metadata files. After downloading it provided information about the 1-1 correspondence between sample id and filename.
May I ask how did you map a certain patient's id in mRNA data to the corresponding id in clinical data? I've seen the MANIFEST as you mentioned, but the id in clinical data and mRNA data are not consistent...
When you download the respective files(expression, clinical, biospecimen etc) there is also an option to download the meta data file along it. Now thats a json file and for each patient there will be a field 'entity submitter id' (TCGA-..-...) barcode which will give you an idea about the patient and will be common in all the respective meta files. Though for mRNA it will be an extended form telling us about type of patient (normal or cancer). You would have to break that for mRNA to map to a patient. Thus total entries that you see for mRNA would be more at times with more than two entries for the same patient indicating for tumor and adjacent normal tissue sample. However for the clinical and biospecimen files the total entires would be equal to total number of patients.
For more information about barcodes follow the link posted by @russhh above.
Just in case anyone needs it, here is some example code in Python because JSON is fiddly.
If you download the manifest with your FPKM data you can match your files to their info like this:
import json
fileName='metadata.cart.2017-06-RESTOFID.json'
with open(fileName) as data_file:
data = json.load(data_file)
for i in range(0, len(data)):
print data[i]['file_name'], data[i]['associated_entities'][0]['entity_submitter_id']
have they changed the structure of the TCGA barcodes?