Dear all,
I am looking into genes of interest affected by CNVs using TCGA data. I am very confused about the immensely different results I get depending on the data source I use:
The GDC data portal (also available via TCGAbiolinks R package) provides a simple data.frame (genes / patients with -1 for losses, 0 for nothing and 1 for gains). This is how GDC CNV data was computed. This is h19. Here, I tend to get VERY few CNVs.
The Xena browser provides gistic2 thresholded files, which again is a simple table (genes / patients with -2,-1,0,1,2, for homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification). This is, however, hg18. Here, I get a lot of CNVs.
Finally, when I manually intersect the Masked Copy Number Segment file (GDC data but CNV segment level downloaded via TCGAbiolinks R package) with gene annotations and apply the same noise cutoff as suggested in the link above, I tend to get a little less than from the Xena data but still much more than stated on the GDC portal. This is h19.
So I am confused. Is the GDC gene level data differently computed? Or are these just homozygous losses / high-level copy number amplification? I very much appreciate input as I do not know which data to use.
Thanks so much!
Thanks for your input!!
Yes - these are then the Masked Copy Number Segment files. And then they use these files to compute the gene-level data.
I absolutely agree but what bugs me here is that if I use the GDC Masked Copy Number Segment files and overlap them with gene annotations, as you have done in PART III (A: How to extract the list of genes from TCGA CNV data), I get completely different results as when compared to the gene-level data from GDC - this is the same data source. I have not applied the other steps you have described as I just want a vector for each of my gene of interest with a status (loss, none, gain) over the patients. So I've just downloaded the data from TCGAbiolinks, overlapped with the annotation, filtered with a noise cutoff of abs(0.3) and kicked out all segments with less than 300 probes and computed a status. These are actually more similar to the gene-level results from Xena (gistic2 thresholded files) or Firebrowse (CopyNumber_Gistic2.Level_4 - all_data_by_genes.txt files). I mean there are differences but much less - I get an overlap of about 80 %, which I find ok especially as Xena and Firebrowse report more. I find 40-60 % losses/gains for a gene of interest for example depending on the data source, which is also reported in a publication. But the gene-level GDC data says there are only 3% losses/gains - and that's what I find strange. I mean that's an immense difference....
Hence my question: Is the GDC gene level data differently computed? Or are these just homozygous losses / high-level copy number amplification? Or can I really expect such high differences?
Thanks so much for your input!
PS: I wanted to stick with TCGAbiolinks, as the rest of the analysis is based on that and I would like to stick to the same data source.
Well, the TCGA give the exact GISTIC 2.0 command that they used through the link. The last time that I obtained copy number data direct from the GDC, this extra GISTIC step was not implemented.
So, the steps for the harmonized data appear to be (starting with the raw signal data from the Affymetrix SNP 6.0 chip):
I'm aware that it's frustrating. The consortium (TCGA) got their Nature publications and then moved forward onto other areas. The data generated runs into petabytes. As funding dried up, there was then less to maintain the data. Some of the open access third level data, though, is 'dangerous', in my opinion, as it contains so much inconsistencies and bias. They could have just made the raw data available to everybody.
TCGAbiolinks (and other third party sources) add an extra amount of confusion to this because they utilise this third level data, which itself is constantly evolving, as we can see.
As long as you document your steps and version control everything, you will be fine.