Analysing gene CNV from TCGA using TCGAbiolinks
1
0
Entering edit mode
21 days ago
Matt • 0

Hi all, I am trying to analyse the copy number change for genes in the TCGA-LUAD dataset, specifically I'm looking to find cases which have high copy number for genes such as MET and ERBB2. Using TCGAbiolinks, I am able to download the CNV data for this purpose, but Ive become a bit confused about the different types of copy number data within this.

Using GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation') gets all data, which seems to be from Affymetrix and illumina platforms. The datatypes here are gene level copy number, allele specific copy number segment, copy number segments etc. The workflows include ABSOLUTE and ASCAT.

I did another query using GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation', data.type = 'Gene Level Copy Number'), and this only gave Affmetrix data. However the data here seems to be a combination of some normal-tumour samples (the samples have an ID with a tumour barcode and normal tissue/blood barcode separated by a semicolon). It also had a lot of duplicates which needed to be removed before the GDCprepare function would work.

Finally, I used GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation', data.type = 'Gene Level Copy Number', sample.type = c('Primary Tumor')). This avoided the issue with the sample IDs having two barcodes and the issue with duplicates.

However the copy numbers for specific cases is very different between the datasets - eg in the one without specifying sample.type, there are several samples with ERBB2 > 10 copies, but not in the 'primary tumor' one.

I know that CNV data often uses normal tissue as a comparison, but I'm confused how in the TCGA data here it only seems to be matched in some samples. I've read a few previous posts about this and read the GDC bioinformatics data guide but I cant seem to see what data I should be using - should I only be looking at the Primary Tumor samples, or should I be specifically looking at the ones which are matched the tumour, and why are the values so different for the same patientID?

Thanks for help!

TCGA • 520 views
ADD COMMENT
3
Entering edit mode
20 days ago
Zhenyu Zhang ★ 1.3k

I can not answer your TCGAbiolinks question, but I can answer your GDC CNV data question. There are two types of segment

  • float number, or segmean
  • integer numbers of absolute copy number that were derived from float numbers using tools like ASCAT or ABSOLUTE

So first you need to decide which type of segment data you want to use based on your purpose. Then for integer number segments, there are gene-level copy numbers that inherits copy number from the segments.

There are multiple CNV pipeline in GDC (because CNV inference is hard, especially for integer numbers). If you see more than one, here is my ranking (you do need to make sure your entire cohort are of the same workflow).

  • ABSOLUTE liftover: although it's from SNP6, they are human curated data
  • ASCAT3: also from SNP6, and semi-curated
  • ASCAT2: from SNP6, not curated
  • ascatNGS: from WGS, not curated

In the future, there will be WGS derived ABSOLUTE calls, either curated or not. You should always target for the curated ones if available.

ADD COMMENT
0
Entering edit mode

Great, thank you, that's very helpful. It seems if I filter to only query the 'primary tumour' samples from GDC all the samples use the ABSOLUTE liftover workflows. And If I don't filter the query but then filter my results to only include ABSOLUTE liftover, they are all Primary Tumour. This answers the main question as this removes the 'double sample' references where there are both normal blood/tissue and tumour samples referenced on one line of the dataframe.

In summary, it seems the human curated ABSOLUTE data uses both the tumour and normal data to come up with the inferred CNV for the tumour gene counts as copy number integers. The CNV numbers for the genes I was looking at are lower when using ABSOLUTE only, but I gather this is normal as they are curated?

Finally, can I treat these ABSOLUTE liftover datapoints like any other CNV, eg infer that clinically relevant CN for my genes is >10 as literature reports >10 CN for these genes having benefit from inhibitors to that gene pathway?

ADD REPLY

Login before adding your answer.

Traffic: 1780 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6