I have a question regarding Copy Number analysis on Exome sequencing data.(NGS data)
I have multiple BAM files (around 30) and I have some target regions which I want to check if there is any Copy number gain.
What would be the best way to do it? I am a newbie in this field and it seems there are lot of tools that do CNV analysis and I have no clue how to choose one and do the analysis.
It is really not that difficult, you can easily get genome-wide copy number estimates yourself. Then perform CBS segmentation on those copy number estimates in case you have tumor samples, if you have non-tumor samples you could also use an HMM-based segmentation program.
In order to get copy number estimates based on read depth you have to compare genomic windows across samples. You cannot compare genomic windows within a sample unless you perform some smart normalization trained on the behaviour of the baits of your sample prep kit. So you need samples to serve as a baseline for each of your 30 cases. In the best scenario, you have matched data (e.g. tumor-normal pairs). If you don't have matched data you have to create a baseline for each genomic window based on the median of your 30 samples or better, a pool of samples from which you are sure they are copy number stable and sequenced with the same procedures.
Count how much reads with mapping quality bigger than 35 map to the genomic windows for each sample. With the bedops suite do: bam2bed < yourSample.bam | awk '{if($5>35)print $0}' | bedmap --count yourGenomicWindows.bed - > yourSample.count
put the count files in one big matrix
import in R (or any other environment where you can perform numerical operations)
Normalize for library size: divide the count in each genomic window by the amount of million mapped reads of that sample
Get baseline: calculate the median value for each genomic window across all samples (or some other samples of which you are sure they are copy number neutral). You will notice that for some genomic windows the median read count will be 0. This means this is a genomic region that is hard to sequence/map. Usually these windows are located near centromeres and telomerers, just deleted those windows, they are not informative.
For each sample, divide the count of each window by the baseline count of that window and log2-transform. Be careful with samples that have homozygous deletions. They will have count 0 so when you calculate the log2(tumor/baseline) you will get -Infinity. As a solution, make sure the minimum and maximum numbers in your data are for example -5 and 5
Segmentation ...
ADD COMMENT
• link
updated 4.9 years ago by
Ram
44k
•
written 10.6 years ago by
Irsan
★
7.8k
0
Entering edit mode
Thanks Irsan for your detailed answer. Let me figure out all the subparts of your answer :) as I am kind of learning these all tools these days.
Indeed, copy number estimates obtained from whole genome sequencing are more accurate. Still, you can do good copy number and LOH analysis with exome data described above. I have done it for tons of samples.
They all suck - the data is too noisy. The folks at the Broad claims XHMM is the best, but I didn't get great results from it. I liked CoNIFER (was able to use it sucessfully here), but watch out - the code has a few bugs (If I remeber correctly, the depth data and probe locations can become missaligned).
when we started caring about CNVs we knew that CNV detection through NGS data was not completely trustworthy, so we've decided to go for the algorithm that convinced us the most through its paper, and that has been exomeDepth. we have experienced great results using exomeDepth, but unfortunately we don't have any success rate at exome level to share. since we do exome sequencing because it's more cost-effective that sequencing a bunch of genes related to a pathology, we then focus on those genes only for clinical purposes. the false positive rate seems to be quite high I must admit, but the feeling we are getting (testing promising ones through MLPA and confirming some of them) is that the false negative rate is so low that we aren't missing the valuable ones. this is critical for clinical purposes, but maybe a high positive rate may not be that useful for proper full exome analysis, without being able to limit the scope of your region of interest in advance.
Got good results with exomeCopy, with a few tweaks even on aneuploid samples. But its a tricky problem. Its immensly helpful to have a large cohort of samples run on the same platform for better baseline estimation and filtering for recurrent events.
Hi Hersman, Jorge and Fred .. thanks for your comments. I will try to play around with those softwares you guyz mentioned.
Lots of answers in these previous questions. (If there weren't a bunch of answers on this question already, I would have closed it as a duplicate)