Hi all,
I'm trying to understand TCGA's Level 3 copy number data. Specifically, I found two tables that appear to be made via ASCAT, and I want to know what the column names mean and how the data has been processed. I've read the GDC copy number pipeline documentation (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/), but it doesn't mention these tables or ASCAT.
The gene-level copy number data looks like this (full tables here) and image link here:
Columns: gene ID, gene name, chromosome, start, end, copy_number, min_copy_number, max_copy_number.
I'd like to use it, but I need to know:
- Are these absolute copy number calls? (not comparisons to the germline or something)
- What do 'min_copy_number' and 'max_copy_number' mean?
- What do NAs in the copy number columns mean? There are about 500 of them in this particular BRCA table (~1%)
I found another table there which has ASCAT allele-specific segment copy number calls and looks like this (image here):
Columns: GDC_Aliquot, chromosome, start, end, copy_number, major_copy_number, minor_copy_number
I thought the gene-level copy number calls might come from intersecting this table with gene locations, but some (~0.3%) of the gene-level calls don't match the ones in this table. I thought the major allele copy number and minor allele copy number here might relate to min copy number and max copy number in the gene-level table, but they often don't match and are probably different things given they don't add up to the total copy number (they're usually just equal).
In general, I would like to know what processing steps were taken to arrive at these tables. (I'm honestly just guessing the gene-level copy numbers come from ASCAT because they show up when you tick 'ASCAT2' in TCGA filters...) For example:
- are they filtered to remove segments that are copy-number altered in the patient's germline or frequent CNVs in the population? Since I think ASCAT assumes the matched normal is diploid
- did they use Circular Binary Segmentation or ASCAT's own ASPCF method?
- did they do GC correction?
- how do these ASCAT tables relate to the other copy number tables available in that section of TCGA, e.g. the copy number segment table? Are they derived from there or computed independently from CEL files/SNP6 arrays?
- how do the two ASCAT tables relate to each other?
Is there documentation anywhere that explains in detail what processing steps were done? Or the code for each step of the pipeline?
Thanks a million for any help.
Dia duit, unfortunately, like many large projects, the TCGA Consortium has always struggled to make their methods clear. I have searched just now and cannot find any information on the ASCAT2 data processing steps. It can be inferred via THIS page that a Sanger Institute workflow called ascatNGS produced the data, the results of which may then have just been uploaded to the GDC for long term archival without any thought on how to explain how the data was produced.
It is indicated that the ASCAT2 gene-level and allele-specific copy number data is generated from the Affymetrix SNP 6.0 array data. I know very well this Affymetrix data and can infer that min and max relate to the min and max copy number across each gene body (multiple array probes will target each gene, each giving back a potentially different signal).
I am unsure on the allele-specific 'major' and 'minor' (edit - see Zhenyu Zhang's comment, below).
To get more specific information, you may very well have to contact the TCGA, or [hopefully] one of the analysts who produced the data may pick up this question (but this is unlikely).
To be frank, if I need to use TCGA copy number data, I follow Tiago's steps in his F1000 workflow, with my adapted version here (sorry, it's a bit messy): How to extract the list of genes from TCGA CNV data This involves taking the TCGA array data from the Broad Institute and deriving recurrent somatic coy number alterations (sCNA).
Go raibh míle maith agat Kevin!
Firstly, thanks for the explanation re: max/min copy number, that makes sense and it just helped me figure out what’s going on with the occasional mismatches between the segment data and the copy number data—the cases where a segment doesn’t match the total copy number of a gene overlapping it occur when there are multiple segments overlapping a gene, and these are the same cases where the gene’s max and min copy number differ. So for example a particular gene overlaps two segments, with total CNs 8 (seg A) and 2 (seg B). The gene’s CN = 8 with a max of 8 and a min of 2, like you said, and that produces one mismatch from seg B. (Although, having expected the averaging to be done using something like segment length or number of SNPs, I’m surprised to see that segment A is only 8% as long as segment B.)
As for the major/minor alleles, I think those could be useful for inferring genes with LOH even if they’re not consistent between genes.
That’s interesting about ascatNGS—that link does seem promising because it mentions both copy number estimates and segment copy numbers, but I don’t understand how it can have used ascatNGS when the ascatNGS paper says it only works on WGS data whereas these tables on TCGA come from SNP6 arrays. Perhaps it means WGS as opposed to WES, not as opposed to SNP arrays?
I don’t think there’s data available for normal samples, since when I select ASCAT2 on TCGA for a given patient I just see one file for segments and one for gene-level, presumably both for the tumour sample. My guess is that it is actual absolute copy numbers because I saw median CN ~3 in a few samples I checked from commonly-WGD tumour types and median CN = 2 in a typically-diploid tumour type sample, but I’ll have to find out for sure somehow. I might ask on the ASCAT Github and then try emailing TCGA/GDC.
At least for the moment, I’m not trying to find recurrent sCNAs. I just need genome-wide absolute copy numbers, so it would be very handy if I could take them directly from these tables.
just to clarify
No problem and, yes, still a few uncertainties there about what the data really is. It seems that you may want the gene-level data. I would also honestly consider contacting Peter van Loo (ASCAT main developer) directly. They were re-processing all of the TCGA samples around the time when I was at UCL Cancer Inst. Perhaps it were those re-processed samples that are now on the GDC, but I don't know.
I'll do that, cheers :)