Question

TCGA Segement Mean, GISTIC and CNVs

7

Entering edit mode

9.8 years ago

Jimbou ▴ 960

Hi,

I have questions regarding the CNV calls calculated from TCGA.

What I understand, they used a CBS algorithm to find segments which are changed compared to a reference and the segment mean value is a measure of this change. In general, a mean log2 Ratio of the probe intensities.

Actually, the segments can be defined as deletions or duplications beyond a threshold (defined from you. Severel papers used +/-0.2).

Sample    Chromosome    Start    End    Number_of_probes Segment_Mean
TCGA-CC-A8HV-01    chr1    51598    5999008    100    -0.0325
TCGA-CC-A8HV-01    chr1    6001979    6002289    153   -2.1264
TCGA-CC-A8HV-01    chr1    6002874    14443436    2    -0.0923

Afterwards, TCGA "re"calculated (to enhance?) the CNV detection results in cancer samples using the segmentation data with GISTIC2. Is this right?

I compared some of the segment mean data and the results from GISTIC2 (estimates) for cancer samples and found differences on gene and sample level.

If the GISTIC2 method provides better results do I have to use then a similar algorithm for non-cancer healthy samples and germline CNVs? And which are these tools? Can I use GISTIC, as well?

Thanks.

TCGA CNV Affy GISTIC2 • 17k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by Jimbou ▴ 960

0

Entering edit mode

Hi Jimbou! Were you able to find the answer to your question? I would like to know which tools is used by TCGA to analyze SNP array data for copy number analysis.

ADD REPLY • link 8.8 years ago by Dataman ▴ 380

Ram · Accepted Answer · 2015-04-13

Hi Jimbou,

Struggling with similar questions over here, as the used threshold is very often arbitrarily described in literature without further explanation / reasoning. What I found so far concerning GISTIC is the following (see http://www.cbioportal.org/faq.jsp):

What is GISTIC? What is RAE?

Copy number data sets within the portal are generated by GISTIC or RAE algorithms. Both algorithms attempt to identify significantly altered regions of amplification or deletion across sets of patients. Both algorithms also generate putative gene/patient copy number specific calls, which are then input into the portal.

For TCGA studies, the table in all_thresholded.by_genes.txt (which is the part of the GISTIC output that is used to determine the copy-number status of each gene in each sample in cBioPortal) is obtained by applying both low- and high-level thresholds to to the gene copy levels of all the samples. The entries with value +/- 2 exceed the high-level thresholds for amps/dels, and those with +/- 1 exceed the low-level thresholds but not the high-level thresholds. The low-level thresholds are just the 'amp_thresh' and 'del_thresh' noise threshold input values to GISTIC (typically 0.1 or 0.3) and are the same for every thresholds.

By contrast, the high-level thresholds are calculated on a sample-by-sample basis and are based on the maximum (or minimum) median arm-level amplification (or deletion) copy number found in the sample. The idea, for deletions anyway, is that this level is a good approximation for hemizygous given the purity and ploidy of the sample. The actual cutoffs used for each sample can be found in a table in the output file sample_cutoffs.txt. All GISTIC output files for TCGA are available at: gdac.broadinstitute.org.

Hope this helps, though I did not yet manage to obtain a copy of the 'sample_cutoffs.txt' for my cancer cohort. In case you found any more information please share.

Cheers