I have CNV SNP array from TCGA that looks like
Sample Chromosome Start End Num_Probes Segment_Mean
Sample1 1 61735 757469 46 0.5909
Sample1 1 757923 12852748 6470 -0.1666
Sample1 1 12857863 13776072 94 0.2141
Sample1 1 13776828 16149915 1792 -0.1672
Sample1 1 16153497 16155010 10 1.1636
Sample1 1 16165661 17012422 355 -0.1473
Sample1 1 17012456 17247727 81 0.1974
Sample1 1 17247845 25583341 5292 -0.1525
Sample1 1 25593128 25611452 14 -2.5747
and I'd like to convert it into the format that looks like
Gene1 0.2729
Gene2 -0.5803
Gene3 0.9857
In the result, '0.2729', '-0.5803', and '0.9857' are the degree of deletion and amplification. And 'Gene1', 'Gene2', 'Gene3' should be named according to HUGO standard.
Where can I find the tools that can do this kind of annotation?
The number of consecutive probes that comprise that segment, and the mean value of thosr probes. See the documentation for the R DNAcopy package for more details.
Do you mean that the Segment_Mean stand for "log2(Detected Number/2)"? So for the numbers that > 0, they are amplification and <0, they are deletion? The Num_Probes seems that there's no need to use it for CNV.
You might not need the number of probes, but for filtering and QC purposes, they can be invaluable, because a) probes are not evenly spaced and b) segments defined by larger numbers of probes are generally higher-confidence scores.
It's not quite that simple - if your value If your segment_mean is 0.07 (~= 2.1 copies), it's not particularly accurate to call that an amplification. The difference from 2 is usually just a result of noise. Setting reasonable thresholds for gain and loss is a hard problem, especially when you take into account things like subclonal copy number events in cancer.
Thanks so much! I really appreciate your help to my PhD candidate study!
How did you calculate that there are ~=2.1 copies if the segment_mean is 0.07
2^0.07*2 = 2.099433
I assume that I rounded :)
(edit - I screwed up while typing in a meeting earlier. You raise 2 to the nth power then multiply by two (since the assumption is that the normal sample is diploid)