I have recently discovered a potential biomarker and would like to validate its prognostic value in the TCGA dataset on late-stage melanama. I realized that one can make survival curves from the days_to_last_followup and days_to_death tabs, but the problem with that is that those survival data do not fully correlate with the related sequencing data. For instance, for a stage I melanoma patient it can be that the submitted_tumor_site is "Regional Lymp Node", which is incompatible with stage I. In other words, the staging was at the time of the original (earliest) diagnosis, and the submitted sample was from a relapsing tumor at a later date (and most likely higher stage). If I were to apply my biomarker to this set, in my opinion the above-mentioned sample would be mis-staged since the sequenced tumor has stage III/IV characteristics, while being staged as stage I.
An alternative approach would be to select samples based on the site of the submitted biopsy (for instance including only tumors that have spread into the regional lymph node), taking into account the fact that the biopsy was taken a number of days after the earliest diagnosis (the days_to_submitted_specimen_dx would provide me with that number). The problem with this is that (again) the staging should be taken into account, as staging obviously is a major determinant of outcome. Therefore, my question is whether the stage at the time the submitted biopsy was taken is available, and if so where I can find that (I have checked https://tcga-data.nci.nih.gov/docs/dictionary/ but did not find it there). If not, could anyone suggest to me what would be a fair alternative for coupling sequencing data to survival?
Thanks!
ps Sorry for being verbose but I found that survival and staging-related questions about the TCGA database are underrepresented and other Biostarrers might benefit from a slightly longer version of this post.
I don't have any great suggestions here. TCGA tumors were cobbled together from whoever was willing to provide samples, and the clinical data is generally pretty lacking.
Thanks for your answer; I have dropped an email at TCGA to check whether I am missing something but I fear that those kind of data are indeed not available. If anything comes out from that I will update the post.
Hi,for other who may also need to figure out these information, explanation of these terms can be found from the Clinical and Biospecimen section of GDC documentation viwer.
I have submitted my question to the TCGA, and I am pasting their entire answer below:
We may not have the exact time interval with corresponding staging as
you request, but below is an explanation of each clinical variable. I
hope that you can use this information for your analysis:
The only overall stage that TCGA collected for SKCM is the
"ajcc_pathologic_tumor_stage" in the clinical_patient_skcm.txt file.
As you indicated, this reflects the stage at initial pathologic
diagnosis and this diagnosis is not necessarily the event that yielded
the biospecimen sent to the BCR. Unfortunately, TCGA did not collect
the stage specifically at the time that the specimen sent to the BCR
was obtained.
The " days_to_initial_pathologic_diagnosis" indicates the date of
initial melanoma diagnosis. The "submitted_tumor_dx_days_to" indicates
the date of diagnosis for the sample submitted to the BCR (actually
days from the initial melanoma diagnosis).
There is also a "days_to_sample_procurement" in the
nationwidechildrens.org_ssf_tumor_samples_skcm.txt file. This
indicates the days to cancer sample procurement for the sample
submitted to the BCR for TCGA in relation to the date of initial
melanoma diagnosis.
If you filter "days_to_sample_procurement" for 0 (or within a number
of days) and use primary tumor (submitted_tumor_site) samples, the
"ajcc_pathologic_tumor_stage" should reflect the stage at the time the
submitted biopsy was taken.
Indeed, as suggested by the TCGA, the days_to_sample_procurement is the more accurate tab to define the date that the tumor was obtained (rather than the days_to_submitted_specimen_dx I mentioned in my original post).
Without wanting to dive into the pathology reports (yet), I see a number of possibilities:
Filter based on the site of the biopsy. For instance, if submitted_tumor_site is "Distant Metastasis", this is by definition from a stage IV tumor. Alternatively, if it is "Regional Lymph Node" it should be stage III or stage IV. In this case, the number of days that can be used for survival curves are last_contact_days_to - days_to_sample_procurement (censored) and death_days_to - days_to_sample_procurement (not censored).
Filter for days_to_sample_procurement around 0 days. Indeed as suggested in the reply by the TCGA team, the stage obtained from ajcc_pathologic_tumor_stage should reflect the stage at the time the biopsy was taken. In this case, the above-mentioned formulas for calculating days for the survival curves can be used too.
Not to care about the mis-staging of samples (not my favourite option!)
Hi, I was looking through this and apparently there is a field days_to_collection, with the exact same definition as days_to_sample_procurement given in CDE. Furthermore the definition of the latter is different in GDC and says that it's the time interval between the collection and procurement date. I contacted TCGA and was told that days_to_collection is the right field for interval between diagnosis and sample collection.
Unfortunately, the values for that field range in several hundred days so it doesn't seem like filtering based on it is really a feasible way to go.
Hi, I am trying to validate some genes using RNAseq levels. I am facing the same questions and feel confused. I am used to work with samples obtained at the diagnosis. But a lot of samples of the SKCM dataset are obtained at a date later than the diagnosis. As stated above, the range is huge.
Rachel gave an interesting example A: Consistency of clinical data from TCGA (melanoma) and explanation concerning the sampling process.
So, my interpretation of patient TCGA-ER-A2NE (from my original question), is: The patient originally presented in 2007 with a melanoma in situ with no metastasis, and that original melanoma was located on the extremities. 567 days later the patient had a metastasis in the ileum. This metastasis in the ileum is the sample that was submitted to TCGA. The patient died of the metastasis 613 days after diagnosis.
My question is what should be taken as time zero. Because the sample obtained at 567 days might relate to the death 46 days later and not to the death 613 days after diagnosis. So, t.kuilman's proposal sounds great, but is it the state-of-the-art for that dataset or such a situation?
I don't have any great suggestions here. TCGA tumors were cobbled together from whoever was willing to provide samples, and the clinical data is generally pretty lacking.
Thanks for your answer; I have dropped an email at TCGA to check whether I am missing something but I fear that those kind of data are indeed not available. If anything comes out from that I will update the post.
I would try to look directly into pathology reports pdfs, which should contain data based on biopsy.
That is indeed something I did not think of yet, but it seems only feasible if you are working on medium-size cohorts.
Hi,for other who may also need to figure out these information, explanation of these terms can be found from the Clinical and Biospecimen section of GDC documentation viwer.