I ran CNVkit as usual in the batch
mode for >100 whole exome mice samples. Then I generated bed
files (one per sample) to get the integer values for the aberrant copy number per segment in each sample as follows:
cnvkit.py export bed -x male WholeExomeMouseSample_1.cns -o WholeExomeMouseSample_1.cns.bed
The output bed
file for a given sample is something like this:
2 87071181 90429432 WholeExomeMouseSample_1 3
2 90429932 111291758 WholeExomeMouseSample_1 3
2 111292258 111646005 WholeExomeMouseSample_1 4
3 29357078 91552512 WholeExomeMouseSample_1 3
3 92014572 114061589 WholeExomeMouseSample_1 3
3 114206302 159934364 WholeExomeMouseSample_1 3
5 3344361 14678781 WholeExomeMouseSample_1 3
5 145365571 146184973 WholeExomeMouseSample_1 3
6 15324588 18681705 WholeExomeMouseSample_1 13
7 34218228 34911854 WholeExomeMouseSample_1 3
...
Now, I want to find the total genomic length (in base-pair) of all segments having aberrant copy number in a given sample (let's call it L_alter_CNA
). In other words, I need the total length of the altered portion of the genome (based on copy number alteration). We can simply calculate this (I think!) by summing over end - start
for all lines in the above bed
file.
However, for most samples, L_alter_CNA
is several fold larger than the real exonic length of the sample.
Why is this? What do I miss here? Or maybe I misunderstand the bed
files generated by CNVkit?
Thank you!
You're right! I have no idea how CNVkit came up with this large contig. I exactly followed their procedure. Hope @Eric Talevich (developer of CNVkit) comments on this.