Hi all, I'm interested in doing an association study using germline SNPs of specific pathways, tumor mutations and clinical data from a patient population in TCGA. I'm currently in the process of getting access to the datasets. If anyone has experience with this type of a study, any advice on how I could go about doing this is greatly appreciated.
Thanks in advance!
Thanks a lot, Kevin. We have already done a study where we looked at association between SNPs/certain haplotypes with increased risk for BrCa. We wanted to see whether our findings will hold in another similar dataset. Since we are in the process of getting access to restricted TCGA data, I thought of analyzing the tumor mutations to see any association with the SNPs/haplotypes of interest.
Hey, I have done a lot of research myself on breast cancer, including from the TCGA. If you are expected to get access to the restricted data, and assuming that you have some bioinformatics expertise, then I think that it would be useful to re-analyse from the BAM file stage (to produce variant listings as VCFs) in order to ensure that you're calling germline and somatic variants in the same way across all samples. Like I mentioned, the publicly-available TCGA mutation data is MAF-formatted (Mutation Annotation Format), and different centers called somatic variants in different ways.
If you have the capacity to do the above at your institute/dept. (including personnel, compute power, etc), then a very interesting analysis would be to convert the VCF data into PLINK format and to conduct your association analysis there. I recently posted a tutorial about how one can convert VCF to PLINK: Produce PCA for 1000 Genomes Phase III in VCF format
When you get the metadata for the breast cancer samples, you could additionally format it as a phenotype file for PLINK and do all sorts of cool analyses, such as adjusting for different traits and BrCa sub-types.
There are undoubtedly many types of analyses that one could do. I've just outlined the one that I think matches best what you're aiming to do. One other that may be of interest is to conduct a lasso regression of all mutation data to find the best predictors of an end-point of interest. This has been done and published in the past by the Caldas group at Cambridge, I believe, but they may not have looked at all end-points.
Thanks for the information! I really appreciate it. I have some experience in analyzing datasets but this will be my first venture into TCGA data. However, the clinical correlations with a SNPs will be done by our collaborator. I'd really appreciate if you could direct me to any similar study that you or anyone else has done to get a better idea.