Hi all, I am trying to analyse the copy number change for genes in the TCGA-LUAD dataset, specifically I'm looking to find cases which have high copy number for genes such as MET and ERBB2. Using TCGAbiolinks, I am able to download the CNV data for this purpose, but Ive become a bit confused about the different types of copy number data within this.
Using GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation') gets all data, which seems to be from Affymetrix and illumina platforms. The datatypes here are gene level copy number, allele specific copy number segment, copy number segments etc. The workflows include ABSOLUTE and ASCAT.
I did another query using GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation', data.type = 'Gene Level Copy Number'), and this only gave Affmetrix data. However the data here seems to be a combination of some normal-tumour samples (the samples have an ID with a tumour barcode and normal tissue/blood barcode separated by a semicolon). It also had a lot of duplicates which needed to be removed before the GDCprepare function would work.
Finally, I used GDCquery(project = "TCGA-LUAD", data.category = 'Copy Number Variation', data.type = 'Gene Level Copy Number', sample.type = c('Primary Tumor')). This avoided the issue with the sample IDs having two barcodes and the issue with duplicates.
However the copy numbers for specific cases is very different between the datasets - eg in the one without specifying sample.type, there are several samples with ERBB2 > 10 copies, but not in the 'primary tumor' one.
I know that CNV data often uses normal tissue as a comparison, but I'm confused how in the TCGA data here it only seems to be matched in some samples. I've read a few previous posts about this and read the GDC bioinformatics data guide but I cant seem to see what data I should be using - should I only be looking at the Primary Tumor samples, or should I be specifically looking at the ones which are matched the tumour, and why are the values so different for the same patientID?
Thanks for help!
Great, thank you, that's very helpful. It seems if I filter to only query the 'primary tumour' samples from GDC all the samples use the ABSOLUTE liftover workflows. And If I don't filter the query but then filter my results to only include ABSOLUTE liftover, they are all Primary Tumour. This answers the main question as this removes the 'double sample' references where there are both normal blood/tissue and tumour samples referenced on one line of the dataframe.
In summary, it seems the human curated ABSOLUTE data uses both the tumour and normal data to come up with the inferred CNV for the tumour gene counts as copy number integers. The CNV numbers for the genes I was looking at are lower when using ABSOLUTE only, but I gather this is normal as they are curated?
Finally, can I treat these ABSOLUTE liftover datapoints like any other CNV, eg infer that clinically relevant CN for my genes is >10 as literature reports >10 CN for these genes having benefit from inhibitors to that gene pathway?