Question

CNV analysis tool on exome data for NGS

9

Entering edit mode

11.1 years ago

subhajit06 ▴ 110

Dear all,

I have a question regarding Copy Number analysis on Exome sequencing data.(NGS data)

I have multiple BAM files (around 30) and I have some target regions which I want to check if there is any Copy number gain.

What would be the best way to do it? I am a newbie in this field and it seems there are lot of tools that do CNV analysis and I have no clue how to choose one and do the analysis.

thanks,

--Subhajit

bam exome next-gen cnv • 15k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 11.1 years ago by subhajit06 ▴ 110

0

Entering edit mode

Hi Hersman, Jorge and Fred .. thanks for your comments. I will try to play around with those softwares you guyz mentioned.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by subhajit06 ▴ 110

0

Entering edit mode

Lots of answers in these previous questions. (If there weren't a bunch of answers on this question already, I would have closed it as a duplicate)

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by Chris Miller 22k

Ram · Answer 1 · 2014-04-18

It is really not that difficult, you can easily get genome-wide copy number estimates yourself. Then perform CBS segmentation on those copy number estimates in case you have tumor samples, if you have non-tumor samples you could also use an HMM-based segmentation program.

In order to get copy number estimates based on read depth you have to compare genomic windows across samples. You cannot compare genomic windows within a sample unless you perform some smart normalization trained on the behaviour of the baits of your sample prep kit. So you need samples to serve as a baseline for each of your 30 cases. In the best scenario, you have matched data (e.g. tumor-normal pairs). If you don't have matched data you have to create a baseline for each genomic window based on the median of your 30 samples or better, a pool of samples from which you are sure they are copy number stable and sequenced with the same procedures.

So what you need to do is:

Define the genomic windows: bedtools makewindows ... | sort-bed - > yourGenomicWindows.bed
Count how much reads with mapping quality bigger than 35 map to the genomic windows for each sample. With the bedops suite do: bam2bed < yourSample.bam | awk '{if($5>35)print $0}' | bedmap --count yourGenomicWindows.bed - > yourSample.count
put the count files in one big matrix
import in R (or any other environment where you can perform numerical operations)
Normalize for library size: divide the count in each genomic window by the amount of million mapped reads of that sample
Get baseline: calculate the median value for each genomic window across all samples (or some other samples of which you are sure they are copy number neutral). You will notice that for some genomic windows the median read count will be 0. This means this is a genomic region that is hard to sequence/map. Usually these windows are located near centromeres and telomerers, just deleted those windows, they are not informative.
For each sample, divide the count of each window by the baseline count of that window and log2-transform. Be careful with samples that have homozygous deletions. They will have count 0 so when you calculate the log2(tumor/baseline) you will get -Infinity. As a solution, make sure the minimum and maximum numbers in your data are for example -5 and 5
Segmentation ...

Ram · Answer 2 · 2014-04-18

4

Entering edit mode

11.1 years ago

hershman ▴ 40

They all suck - the data is too noisy. The folks at the Broad claims XHMM is the best, but I didn't get great results from it. I liked CoNIFER (was able to use it sucessfully here), but watch out - the code has a few bugs (If I remeber correctly, the depth data and probe locations can become missaligned).

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by hershman ▴ 40

Ram · Answer 3 · 2014-04-18

when we started caring about CNVs we knew that CNV detection through NGS data was not completely trustworthy, so we've decided to go for the algorithm that convinced us the most through its paper, and that has been exomeDepth. we have experienced great results using exomeDepth, but unfortunately we don't have any success rate at exome level to share. since we do exome sequencing because it's more cost-effective that sequencing a bunch of genes related to a pathology, we then focus on those genes only for clinical purposes. the false positive rate seems to be quite high I must admit, but the feeling we are getting (testing promising ones through MLPA and confirming some of them) is that the false negative rate is so low that we aren't missing the valuable ones. this is critical for clinical purposes, but maybe a high positive rate may not be that useful for proper full exome analysis, without being able to limit the scope of your region of interest in advance.

score 2 · Answer 4 · 2014-09-24

2

Entering edit mode

10.6 years ago

Christian ★ 3.1k

Got good results with exomeCopy, with a few tweaks even on aneuploid samples. But its a tricky problem. Its immensly helpful to have a large cohort of samples run on the same platform for better baseline estimation and filtering for recurrent events.

ADD COMMENT • link 10.6 years ago by Christian ★ 3.1k

Ram · Answer 5 · 2014-04-18

1

Entering edit mode

11.1 years ago

Fred ▴ 790

Maybe you can try Control FREEC, that handles exome sequencing data.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by Fred ▴ 790

0

Entering edit mode

Anyone here with experience running Control FREEC? How are the results?

ADD REPLY • link 11.1 years ago by Christian ★ 3.1k

0

Entering edit mode

It only works for exome data when you have matched samples. But when you have matched samples, copy number analysis is much easier anyway.

ADD REPLY • link 10.8 years ago by Irsan ★ 7.8k

Ram · Answer 6 · 2014-04-18

1

Entering edit mode

11.1 years ago

Chris Whelan ▴ 590

This paper offers an evaluation of four commonly used tools (XHMM, ExomeDepth, CoNIFER, and CONTRA). The results show how hard a problem it is:

http://onlinelibrary.wiley.com/doi/10.1002/humu.22537/abstract

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by Chris Whelan ▴ 590