Meaning of TCGA CNV data
1
4
Entering edit mode
7.7 years ago
akij ▴ 190

I need to do some statistical calculation on CNV data that is publicly available in TCGA website. I am from computer background and no idea about the meaning of these data. I tried searching for the meaning of the files, how they are structured and all but nothing was helpful. It would be nice if someone could give a short overview of the meaning of data in each column. A sample of data present in a CNV file

        Sample      Chromosome     Start        End     Num_Probes        Segment_Mean 
DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_G03_729530   1   61735   415164  28  -0.0504 
DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_G03_729530   1   462793  629241  4   1.822 
DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_G03_729530   1   668210  2138242 350 -0.0311

So I want to know what is the underlying meaning of sample, how they are named. Sample name seems always a big name. Then I understand the columns chromosome, start and end. But I don't understand the meaning of Num_probes and Segmentation_mean?

tcga CNV • 9.0k views
ADD COMMENT
0
Entering edit mode

Very nice answer : I have couple of questions . Now with grch38 How can we run GISTIC, since we need new marker file etc ? Any input is appreciated Also,, Sample names are eg. CYANS_p_TCGAb_422_423_424_NSP_GenomeWideSNP_6_B03_1513914. How can we convert them into TCGA barcodes? Thannks

ADD REPLY
0
Entering edit mode

You can try this function to convert a filename to a TCGA barcode: C: problem in matching the names between file names and patients Id in TCGA

It will not work for all file-names, though. Be acutely aware of the legacy = TRUE/FALSE parameter that is passed to the function

ADD REPLY
14
Entering edit mode
7.7 years ago
Mattias Aine ▴ 640

Sample should be the unique sample identifier for this SNP-array experiment, it corresponds to a unique sample run by TCGA and can be a normal or tumor sample. Will map to a unique id like this: https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

The rows correspond to contiguous chunks along the chromosome with the same DNA copy-number. "Chromosome" is the chromosome, can be 1-22, X or Y (see human genome). Start is the physical start location for the segment along said linear chromosome, end is the end coordinate. Num_probes is the number of SNP-array probes falling within the segment (these were used to call copy numbers). Segment_Mean is the estimated copy number for that particular segment.

So this sample has (log2) copynumber -0.0504 on chromosome 1 from bases 61735 to 417164, at 462793 along the same chromosome the copy number changes to 1.822 (i.e. a genomic gain) and this segment ends at 629241. The next segment for which you have information is 668210 to 2138242 where the copy number has returned to a basically normal level. These segments should keep coming for chr1 until about 255M after which the chr2-segments should start.

The data should basically be log2( intensity of sample / intensity of reference with normal copynumber ), this means that segments with a normal diploid copy number are around log2( 2/2 )=0, single copy losses are at log2(1/2)=-1, homozygous deletions at log2(0/2)=-Inf, correspondingly for gains you go from log2(3/2) and upward with experimental noise leading to deviations from these discrete levels.

For some light reading about the concept of copy-number analysis check out these old papers: https://www.ncbi.nlm.nih.gov/pubmed/16899659 and https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-9-r136 and maybe http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-382

ADD COMMENT
1
Entering edit mode

Because every sample has one run on normal (usually) blood DNA as well as one tumor tissue. This gets us to 2 files per sample. Running a blood normal as well as tumor sample allows you to filter out private natural copynumber changes from the patient data so you know what is really a tumor related change.

If you check the full names of the files you refer to you can see that they contain either "grch38.seg" or "nocnv.grch38.seg". The first is a segmented (see papers in my answer) copynumber profile mapped to the hg38-build of the human genome and the second is the same but filtered for CNV's (normally occurring Copy Number Variant's != tumor specific change). This gets us to 4 files per unique patient.

If you use legacy data you will double this as there both the normal and tumor are mapped to hg18 and hg19 and each of these have a nocnv as well as "cnv" file.

If you check the file metadata for the sample you linked to you will find that one of them links to a biospecimen that is a blood derived normal and the other is from a primary tumor.

Tumor: https://portal.gdc.cancer.gov/files/50a00a11-4a5c-4cfe-97d3-381229658433 Blood: https://portal.gdc.cancer.gov/files/adae0cef-63dd-4d54-86e3-f5cce3386f21

ADD REPLY
0
Entering edit mode

Thanks for the such nice reply.

ADD REPLY
0
Entering edit mode

very nice answer : I have couple of questions . Now with grch38 How can we run GISTIC, since we need new marker file etc ? Any input is appreciated Also,, Sample names are eg. CYANS_p_TCGAb_422_423_424_NSP_GenomeWideSNP_6_B03_1513914. How can we convert them into TCGA barcodes? Thanks

ADD REPLY
0
Entering edit mode

@Mattias Aine how do we plot these CNV data ?

ADD REPLY

Login before adding your answer.

Traffic: 1927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6