Question

Number of Probes per Segment for TCGA CNVs

0

Entering edit mode

5.8 years ago

jrlarsen • 0

I am observing focal and broad events in CNVs found in the TCGA. I’ve noticed the post segmented CNV data has number of probes as low as 2, and produce segments that appear as possible noise (especially in normal profiles). I am curious if there is a minimum number of probes threshold, preferably with a reference, to filter out low number of probe segments? I know the “join segment” option on GISTIC2.0 has a default of 4, but I can’t find a reference or justification for this choice of minimum number of probes. Any advice or assistance is welcome, and thank you!

TCGA CNV CBS • 2.9k views

ADD COMMENT • link updated 5.8 years ago by Biostar 20 • written 5.8 years ago by jrlarsen • 0

1

Entering edit mode

Can you clarify the exact data that you have obtained? The TCGA data, in particular copy number, has been processed and re-processed by many sources.

You can see the exact processing steps here: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/#copy-number-segmentation

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, thank you for responding. Your link to the processing step really helped, and may require me to repost with a new question. I followed the link to the processing, then to the DNAcopy reference manual, and found the default command used by the TCGA for CBS (segmentation). It appears my question is about the “min.width” in segmentation() which is set to 2. Would it be incorrect to filter out segments from the output of segmentation() of segments with few markers like 2? If so how many markers are normally considered appropriate for copy number data? Currently the CNVs look noisy, especially normals, since there is abundance of segments with very few markers.

ADD REPLY • link 5.8 years ago by jrlarsen • 0

0

Entering edit mode

2 does seem quite low, as the probes are extremely small relative to the potential size of copy number variants, and also considering the marker spacing on the chip. Indeed, the vast majority of the TCGA copy number segment data is derived by applying the CBS algorithm on data derived from the Affymetrix SNP 6.0 array, which has probes for both SNPs and CNV. My PhD project involved processing samples from this same array type but, back then, CBS wasn't really well known. I recall deriving copy number segments from a hidden Markov model and having min number of markers of 50, I believe. Indeed, I just pulled it from my thesis:

...minimum of 50 markers per segment, p-value cut-off of <0.0001, and a signal-to-noise ratio of 0.5 (low ratios result in many break-points being reported; higher results in fewer). Each patient/control’s own lymphocyte sample was used as the reference during segmentation, with minimum segment sizes of 1, 50, 100, and 1,000Kb being used for viewing different-sized amplifications and deletions across different samples.

To reprocess the data, you'd have to obtain the controlled access CEL files. Given just the level 3 data, you can probably just filter out events that are below a certain marker threshold.

Note that Broad Institute have already re-processed this data with GISTIC 2.0. I have elaborated a pipeline here, which is annoyingly spread across different threads: A: How to extract the list of genes from TCGA CNV data

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you so much Kevin, your thesis does seem to address this issue. I would be happy to read your thesis, but maybe it’ll be a quicker read if this was applied in an article you published? I do enjoy a good Materials & Methods section haha

ADD REPLY • link 5.8 years ago by jrlarsen • 0

0

Entering edit mode

The thesis is available via the British Library: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.657536

It may not be informative, though. The related published work is Genomic analysis of circulating cell-free DNA infers breast cancer dormancy

Back then, I was not even using R, you have to realise. R was not that common back then.

----------------------------

Your best bet is to start with the other thread that I mentioned: A: How to extract the list of genes from TCGA CNV data

The related published work to that is: Racial differences in endometrial cancer molecular portraits in The Cancer Genome Atlas. My affiliation with the University of Leicester goes back to 2010, but has been as External Collaborator since 2013.

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k