I'm using TCGA gene expression data. At some part of my work I need to do survival analysis . I wonder to know that, is there any way to get some information from TCGA to do survival analysis of the sample which I have gene expression of them?
I'm using TCGA gene expression data. At some part of my work I need to do survival analysis . I wonder to know that, is there any way to get some information from TCGA to do survival analysis of the sample which I have gene expression of them?
I'm currently in the middle of something similar - the TCGA Bioinformatics team very kindly helped me out.
If you want to get the raw data yourself, it is in the "Clinical" data. These can be downloaded as text or XML - I've mostly looked at the XML files. I believe there is normally one file for the patient, and one file for every sample taken. (Normally there's just one sample, obtained at time of surgery.)
The problem is that dates in the clinical data, such as date of death, have been redacted to preserve patient privacy. I think that all dates have been replaced with values giving the number of days since original diagnosis.
If you just want to do a survival curve, you are looking for the number under the XML tag "days_to_death".
The day the particular sample was taken is under "days_to_sample_procurement" (i.e. number of days between diagnosis and sample procurement). I think you could find other useful numbers by just doing a find for "days_to".
Hope this helps,
Stephanie
It's easy to fetch those data with R.
TCGA-Assembler is a very good tool for you to get those data easily.
On the assumption that you are familiar with R.
First, download this tools, and unpackage it.
Second,
source("Model_A.R")
Third, execute the next sentence.
DownloadClinicalData(traverseResultFile = "./DirectoryTraverseResult_Jul-08-2014.rda", saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler", cancerType = "BLCA", clinicalDataType = c("patient", "drug", "follow_up", "radiation"))
saveFolderName ="./UserManualExampleData/RawData.TCGA-Assembler" #set the dir
cancerType = "BLCA" #choose the cancer type
clinicalDataType = c("patient", "drug", "follow_up", "radiation")) #choose the type of the clinical data you want to download
If you just want get the data for survival analysis, you can just choose follow_up
, as choose the days_to_death
and days_to_last_follow_up
columns in the file as the death and censored data for survival analysis.
Or you just can get the clinical data for this weblink
good luck~
I have strong opinion against using TCGA data for survival analysis, please correct me if I am wrong.
If you check days_to_death
, or days_to_last_contact
, you would found days as early as 2000 days ago, way before TCGA even started. My suspicion is that these were patient from other programs, and they were diagnosed before TCGA project. If I am correct on this, there is a huge bias here that only live person were later recruited to TCGA, while the dead ones from these legacy programs were hidden and never show up in TCGA. I guess the majority people who used TCGA data for analysis never thought about this.
So these dates need to be adjust to the TCGA dates, by subtracting either days_to_collection
or days_to_procuration
of the samples. The new problem here is the second is almost all empty, while the first dates is about 80% empty. This means, by starting with a 500 patient project, you get about 400 with either available days_to_death or days_to_last_contact
, and ran down to less than 100 with days_to_collection
. This number is not enough of any kind of survival comparisons by say biomarker, clinical categories, or etc.
Try Synapse platform (need to register but you can access with a google account).
https://www.synapse.org/#!Synapse:syn300013
For example, here you can find survival data for Lung Squamous Cell Carcinoma.
Even if a little late...you can analyze survival by using the example here
http://bioinformatics.mdanderson.org/Supplements/ResidualDisease/Reports/osCurves.html
That's the main part about overall survival (in ovarian caner) but it also has links on how to build the dataset and build your own analysis for your preferred tumor type
This should be the easiest way, you can also select the datasets from PROGgene or you can upload your own datasets. FYI: It also has datasets from TCGA.
http://watson.compbio.iupui.edu/chirayu/proggene/database/?url=proggene
Reference: http://www.biomedcentral.com/1471-2407/14/970/abstract
You can also check previous posts explaining how to download Clinical data from TCGA.
accidentally posted in wrong comment section, sorry!
A website for Breast cancer survival curve in different subtypes: luminal A, luminal B, Basal, Her2 and Normal-like. http://tumorsurvival.org/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank, but which value I should take it out. I've looked at XML file of it and I found the line with tage
days_to_death
it's like this :That's interesting. I presume the XML file works like an HTML file, so you want the value in between the two tags. (I've replaced the angle brackets with square because Biostar is interpreting them as HTML.)
e.g. (tags shortened a bit)
I've had a look at an example file, and it looks to me like if there is a missing value the file contains the start tag but not the end tag. In this case, you are missing the days_to_death, which suggests the patient is still alive.
If you look at the example below, the days_to_death value is also missing, but the vital status is "Alive" and there is a value for days to last followup.
Thanks, but what is
xsd_ver="1.12"
?hey dirigible2012 & Stephanie, is there a file that explains about the xml tags for the clinical data? I am also doing the survival analysis and I am looking at the xml files, they seem to be really large and convoluted. its taking time to understand them, I was wondering if there is some guide for the xml tag description, then I can parse out the necessary information.. I might need other clinical data as well in future.
thanks so much
We (at SolveBio) have actually gone through the individual clinical patient information files for each TCGA cancer type and parsed out some of this information. See https://www.solvebio.com/library/TCGA/1.2.0-2015-02-11/PatientInformation for more information about the data and this ipython notebook for an example of how to access the data (SolveBio is free for academics/noncommercial-use, so sign up and try it out). It was kind of a mess but I think we've done a decent job. ICGC is a quite a bit easier to work with and includes a lot of TCGA.