Problem: I want to analyze copy numbers for whole-genome next generation sequencing data, for non-human samples. Ideally, I'd copy number information displayed in something like the UCSC genome browser, which means I need copy number info for every point (e.g., assume window size of 1000) along the genome.
Attempted methods:
I start with bam files.
I've looked into a variety of tools (CNVnator, ReadDepth, cn.mops, etc.) but either 1) I've been having trouble getting these to compile, or 2) they don't work well for non-human samples, etc.
I have been using this thread as a base algorithm i.e., as Irsan points out:
- Define the genomic windows:
bedtools makewindows ... | sort-bed - > yourGenomicWindows.bed
- Count how much reads with mapping quality bigger than 35 map to the genomic windows for each sample. With the bedops suite do:
bam2bed < yourSample.bam | awk '{if($5>35)print $0}' | bedmap --count yourGenomicWindows.bed - > yourSample.count
- put the count files in one big matrix
- import in R (or any other environment where you can perform numerical operations)
- Normalize for library size: divide the count in each genomic window by the amount of million mapped reads of that sample
- Get baseline: calculate the median value for each genomic window across all samples (or some other samples of which you are sure they are copy number neutral). You will notice that for some genomic windows the median read count will be 0. This means this is a genomic region that is hard to sequence/map. Usually these windows are located near centromeres and telomerers, just deleted those windows, they are not informative.
- For each sample, divide the count of each window by the baseline count of that window and log2-transform. Be careful with samples that have homozygous deletions. They will have count 0 so when you calculate the log2(tumor/baseline) you will get -Infinity. As a solution, make sure the minimum and maximum numbers in your data are for example -5 and 5
- Segmentation ..."
However:
- What is segmentation, and why is it useful?
- How does this help me deal with large-scale deletions or inversions?
ALSO, this doesn't have to be particularly precise, i.e., I know getting highly-accurate CNV algorithms are the things papers are made off :) I'm just looking for a somewhat-accurate starting point.
Thanks!