Question

Comparing Read Depth In Sampes

1

Entering edit mode

11.9 years ago

bari1 ▴ 40

Hi all,

I have depth of coverage (read counts at each base ) from 8 samples, and i am interested to compare them for finding the regions with differences in the depth (CNVs, deletions etc). I have first normalized these samples and now want to apply some method to compare and see the significant differences. I am using perl script. The problem is I am new to statistical analysis and need little guidance. I know some softwares can do the same job like CNVseq (but they are using the statistics that i donot understand :() What are your suggestions, what is the most simple way to compare these ? Could fold change be applied? I applied z-normalization, is it applicable? Any literature, scripts, and suggestions would work. (Please note i am new to statistical analysis)

EDIT: Please note the data is from resequencing of whole genome and not RNAseq. The samples represents different dog populations.

coverage normalization illumina cnv • 6.9k views

ADD COMMENT • link updated 11.9 years ago by Irsan ★ 7.8k • written 11.9 years ago by bari1 ▴ 40

2

Entering edit mode

I would suggest learning the CNVseq algorithm. Even if you want to write your own version, you need to understand the existing works. This is science.

ADD REPLY • link 11.9 years ago by lh3 33k

0

Entering edit mode

Thanks lh3, I will learn CNVSeq to see how it works, but for my current question it might not work, as i already have information on deletions and inversions using split read and paired end. My interest is to see the depth at these regions by comparing the samples mentioned above.

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

0

Entering edit mode

I have only looked at the homepage and the first manual page of CNVseq. I could be wrong, but... doesn't CNVseq do exactly what you want: comparing the read depth like what you do with arrayCGH?

ADD REPLY • link 11.9 years ago by lh3 33k

1

Entering edit mode

I completely agree with you. I think i misinterpreted the information due to lack of my knowledge.

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

0

Entering edit mode

Could you elaborate more on the data that was sequenced ? Is it genome-seq ? I hope it's not RNA-seq.

ADD REPLY • link 11.9 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Its re-sequencing data (genomic) using illumina paired end.

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

0

Entering edit mode

Are these tumor samples? Do you have matched controls for these samples? Is this DNA/RNA seq, if it is DNA-seq, is it captured or whole genome, shallow seq or deep seq?

ADD REPLY • link 11.9 years ago by Irsan ★ 7.8k

0

Entering edit mode

sorry, for my poor explanation of data. Its DNA sequence data.

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

score 0 · Answer 1 · 2013-01-28

0

Entering edit mode

11.9 years ago

Sean Davis 27k

There is quite a literature on structural variation. I'd suggest starting to read some of that if you are interested in the statistics. If you are more interested in interpreting your results in a biological context, then there are MANY software packages you can try. They can be divided into three camps, basically:

Read-depth-based methods
Split-read-based methods
hybrid methods that incorporate both read-depth and split-read approaches

ADD COMMENT • link 11.9 years ago by Sean Davis 27k

0

Entering edit mode

thanks Sean, it was very useful. Infact I have already tried a couple of softwares using split read and paired end approach to find the breakpoints. But now i was to validate those marks by using read depth and coverage information. This was the reason to use depth of coverage and look into the regions.

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

2

Entering edit mode

I use my own simple python code for this: https://github.com/seandavi/ngCGH. But, as others have pointed out, there are plenty of tools that are capable of similar analyses and I will not hold my approach (very simple) as better than anyone else's.

ADD REPLY • link 11.9 years ago by Sean Davis 27k

0

Entering edit mode

I wonder, the code takes single tumor.bam and normal.bam for comparisons. Or we could give more than one tumor.bam files and normal.bam files at a time?

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

0

Entering edit mode

The tool is used to compare two samples, typically from the same patient. If you have more than one pair, just run it for each pair. If you have unmatched BAM files (not tumor/normal), you'll have to decide what to compare with. If you want to compare to a "pool" of BAM files, simply merge them all into a single BAM file first.

ADD REPLY • link 11.9 years ago by Sean Davis 27k

0

Entering edit mode

Thanks, this was very useful

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

score 0 · Answer 2 · 2013-01-28

0

Entering edit mode

11.9 years ago

Irsan ★ 7.8k

If you are interested in copy number estimates from DNA-seq data you can easily calcluate RRKM-values for pre-defined genomic windows and calculate the ratio of case/control where case can be for example a tumor sample and control a matched control sample or the median of all your samples

use bedtools makewindows to make (overlapping) genomic windows and bedtools map or bedmap from the bedops suite to count the amount of reads overlapping your window. Divide each number by the total number of million mapped reads in your sample (you can get that with samtools flagstat sample.bam) to correct for coverage effects and divide by window size to correct for fragment sizes (if all your windows are of same length this is not needed but depending on how you make the windows not 100% of your windows might have same length)

ADD COMMENT • link 11.9 years ago by Irsan ★ 7.8k

0

Entering edit mode

Irsan i have windows of equal sizes of 1kbp. Can you explain little more RRKM? Are you explaining in RRKM by using bedtools and bedmap and then dividing by million mapped reads? My samples also have varing depth, could we correct these also ? because if i take the median of my case samples (it could be baised due to not normalized/scaled)?

ADD REPLY • link 11.9 years ago by bari1 ▴ 40

1

Entering edit mode

RPKM = reads per kilobase per million mapped reads so when you have a sample with 100M reads in total and 50.000 reads mapping to a window of 10k, the reads per kilobase = 50.000 / 10 = 5000. Divide that by 100 (100M reads) you get 5000 / 100 = RPKM = 50 (now you have also corrected for the fact that some sample have more depth). Do that for each sample for each region. Then afterwards, take the median of each region across your samples, that will be your baseline. Divide each RPKM for each region for each sample by the corresponding baseline value. Then you have RPKM ratio, then take the log2 of the RPKM ratio and you have Log R Ratio.

Because this way you express the copy number status as a relative measure where each region is compared with the median of that region across all your samples you take away lots of the regional biases. If you have a set of control samples (that should have ploidy 2) that might be even better. The best case is of course to have a matched control.

ADD REPLY • link 11.9 years ago by Irsan ★ 7.8k

0

Entering edit mode

here I am taking about the bias due to varing depth between samples (one sample may have on average 3x coverage while the others 20x coverage).

ADD REPLY • link 11.9 years ago by bari1 ▴ 40