Entering edit mode
5.8 years ago
jrlarsen
•
0
I am observing focal and broad events in CNVs found in the TCGA. I’ve noticed the post segmented CNV data has number of probes as low as 2, and produce segments that appear as possible noise (especially in normal profiles). I am curious if there is a minimum number of probes threshold, preferably with a reference, to filter out low number of probe segments? I know the “join segment” option on GISTIC2.0 has a default of 4, but I can’t find a reference or justification for this choice of minimum number of probes. Any advice or assistance is welcome, and thank you!
Can you clarify the exact data that you have obtained? The TCGA data, in particular copy number, has been processed and re-processed by many sources.
You can see the exact processing steps here: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/#copy-number-segmentation
Hi Kevin, thank you for responding. Your link to the processing step really helped, and may require me to repost with a new question. I followed the link to the processing, then to the DNAcopy reference manual, and found the default command used by the TCGA for CBS (segmentation). It appears my question is about the “min.width” in segmentation() which is set to 2. Would it be incorrect to filter out segments from the output of segmentation() of segments with few markers like 2? If so how many markers are normally considered appropriate for copy number data? Currently the CNVs look noisy, especially normals, since there is abundance of segments with very few markers.
2 does seem quite low, as the probes are extremely small relative to the potential size of copy number variants, and also considering the marker spacing on the chip. Indeed, the vast majority of the TCGA copy number segment data is derived by applying the CBS algorithm on data derived from the Affymetrix SNP 6.0 array, which has probes for both SNPs and CNV. My PhD project involved processing samples from this same array type but, back then, CBS wasn't really well known. I recall deriving copy number segments from a hidden Markov model and having min number of markers of 50, I believe. Indeed, I just pulled it from my thesis:
To reprocess the data, you'd have to obtain the controlled access CEL files. Given just the level 3 data, you can probably just filter out events that are below a certain marker threshold.
Note that Broad Institute have already re-processed this data with GISTIC 2.0. I have elaborated a pipeline here, which is annoyingly spread across different threads: A: How to extract the list of genes from TCGA CNV data
Thank you so much Kevin, your thesis does seem to address this issue. I would be happy to read your thesis, but maybe it’ll be a quicker read if this was applied in an article you published? I do enjoy a good Materials & Methods section haha
The thesis is available via the British Library: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.657536
It may not be informative, though. The related published work is Genomic analysis of circulating cell-free DNA infers breast cancer dormancy
Back then, I was not even using R, you have to realise. R was not that common back then.
----------------------------
Your best bet is to start with the other thread that I mentioned: A: How to extract the list of genes from TCGA CNV data
The related published work to that is: Racial differences in endometrial cancer molecular portraits in The Cancer Genome Atlas. My affiliation with the University of Leicester goes back to 2010, but has been as External Collaborator since 2013.