Question

Analysing gene CNV from TCGA using TCGAbiolinks

0

Entering edit mode

12 weeks ago

Matt • 0

Hi all, I am trying to analyse the copy number change for genes in the TCGA-LUAD dataset, specifically I'm looking to find cases which have high copy number for genes such as MET and ERBB2. Using TCGAbiolinks, I am able to download the CNV data for this purpose, but Ive become a bit confused about the different types of copy number data within this.

Using GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation') gets all data, which seems to be from Affymetrix and illumina platforms. The datatypes here are gene level copy number, allele specific copy number segment, copy number segments etc. The workflows include ABSOLUTE and ASCAT.

I did another query using GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation', data.type = 'Gene Level Copy Number'), and this only gave Affmetrix data. However the data here seems to be a combination of some normal-tumour samples (the samples have an ID with a tumour barcode and normal tissue/blood barcode separated by a semicolon). It also had a lot of duplicates which needed to be removed before the GDCprepare function would work.

Finally, I used GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation', data.type = 'Gene Level Copy Number', sample.type = c('Primary Tumor')). This avoided the issue with the sample IDs having two barcodes and the issue with duplicates.

However the copy numbers for specific cases is very different between the datasets - eg in the one without specifying sample.type, there are several samples with ERBB2 > 10 copies, but not in the 'primary tumor' one.

I know that CNV data often uses normal tissue as a comparison, but I'm confused how in the TCGA data here it only seems to be matched in some samples. I've read a few previous posts about this and read the GDC bioinformatics data guide but I cant seem to see what data I should be using - should I only be looking at the Primary Tumor samples, or should I be specifically looking at the ones which are matched the tumour, and why are the values so different for the same patientID?

Thanks for help!

TCGA • 822 views

ADD COMMENT • link 11 weeks ago by Matt • 0

score 3 · Answer 1 · 2025-04-17

I can not answer your TCGAbiolinks question, but I can answer your GDC CNV data question. There are two types of segment

float number, or segmean
integer numbers of absolute copy number that were derived from float numbers using tools like ASCAT or ABSOLUTE

So first you need to decide which type of segment data you want to use based on your purpose. Then for integer number segments, there are gene-level copy numbers that inherits copy number from the segments.

There are multiple CNV pipeline in GDC (because CNV inference is hard, especially for integer numbers). If you see more than one, here is my ranking (you do need to make sure your entire cohort are of the same workflow).

ABSOLUTE liftover: although it's from SNP6, they are human curated data
ASCAT3: also from SNP6, and semi-curated
ASCAT2: from SNP6, not curated
ascatNGS: from WGS, not curated

In the future, there will be WGS derived ABSOLUTE calls, either curated or not. You should always target for the curated ones if available.