I recently downloaded the TCGA colorectal clinical data information from GDC portal. From this I got the following files.
nationwidechildrens.org_clinical_patient_coad.txt
nationwidechildrens.org_clinical_patient_read.txt
I combined both the files and a total of 628 patients data is available. Among them I see
563 - Alive
65 - Dead
For example
times bcr_patient_barcode patient.vital_status
49 TCGA-5M-AAT4 Dead
290 TCGA-5M-AAT6 Dead
154 TCGA-3L-AA1B Alive
1200 TCGA-5M-AATE Alive
648 TCGA-A6-2671 Alive
All the 628 patients have information available about Days_to_Last_followup
.
Similarly, I checked the cbioportal TCGA Provisional colorectal clinical data cbioportal colorectal. Here the patient_vital_status
is of different numbers.
502 - Alive
130 - Dead
8 - NA
And in this, almost 60 patients had NA
for Days_to_Last_followup
. I'm interested in doing survival analysis. Now very confused to select the right one for the analysis.
For example
times bcr_patient_barcode patient.vital_status
NA TCGA-5M-AAT4 Dead
NA TCGA-5M-AAT6 Dead
154 TCGA-3L-AA1B Alive
1200 TCGA-5M-AATE Alive
648 TCGA-A6-2671 Dead
So, from the data above both GDC
and cbioportal
show different information.
Looks like cbioportal
clinical data is the updated one as it shows more patients ad Dead
. But why some patients in cbioportal clinical info doesnt have Days_to_Last_followup
? Which of the above is the right one for the Analysis?
thanq
The GDC should be the most updated as it is the primary source of TCGA data. cBioPortal is a third-party (developed at MSKCC) that is not part of the NIH. The issue is that the clinical data may be referencing different samples / aliquots. cBioPortal may also have imputed missing values that they encountered in the original data that they pulled from the GDC.
I would always go by the data at the GDC because it is the primary source. It is a common finding that discrepancies exist between the GDC and the third party web-sites. You will be fine once you simply quote the exact source and version of your data. If no version is available, then date-stamp it in your methods.
Obviously patients cannot come back to life, so there are logical reasons behind the discrepancies that you observe.
If you say GDC is most updated one compared to cbioportal, I see 65 Dead in GDC and 130 Dead in cbioportal. This cannot be a small difference.
Should I ask GDC community ppl about this?
They could simply be referencing different patients from the same cancer - I am not sure. I have also heard that the GDC clinical data contains errors. It would be interesting to also see how the patient numbers appear on the GDC Legacy Archive. I would contact both cBioPortal (MSKCC) and GDC.
As the analyst, in certain situations, the best we can do is just date-stamp and version control the data that's given us, i.e., in order to protect our own butts.
I second your suggestion (y)
Yes, there may be different patients in both cbioportal and GDC, but in my question there is one patient
TCGA-A6-2671
which is alive inGDC
and dead incbioportal
.The information / paper trail for the patient may be difficult to find. Another option: just set to NA all discrepancies between both the GDC and cBioPortal, although then you reduce your sample n
I see the patients are same in both the portals.
From the same place where I downloaded patient clinical data for both colon and rectal in GDC, I have also downloaded the following files
I see the vital status is different in this compared to patient clinical data. What is this follow_up files?
GDC
Those follow up files may be defined here: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization
Getting the most out of the clinical data from the TCGA is indeed difficult, I admit. It has a high level of missingness.