Question

Genome coverage against normalized average depth

0

Entering edit mode

8.0 years ago

konstantinkul ▴ 110

Hello!

Does anybody dealt with this type of graph "A plot of genome coverage against normalized average depth" http://prntscr.com/df9wmc. I really need it for optimization of procedures of WGS. I was trying to figure it out how to generate it, but I couldn't. The description in methods section was not obvious for me (http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-1, Methods, Genome coverage).

"Mapped reads for each replicate data sets were normalized by average genome coverage to 11× and 21× for clinical and non-clinical samples respectively, and then pooled or analyzed independently depending on analysis metrics used. For the clinical samples, reads originating from host contamination were filtered out (by aligning all reads to P. falciparum 3D7 reference genome) prior to analysis, hence the relatively low average genome coverage depth for these samples. Evaluation of coverage depth and evenness was based on cumulative distributions over a normalized overall average depth. A measurement of low-coverage index lci (d) is defined as the integration of the cumulative distribution C(x) from 0 to d giving an overall assessment of the coverage at the low end of distribution: lci(d)=∫d0C(x)dx The value lci (0.5) that gives a measurement of the coverage below half of the average depth in the distribution was used to compare evenness of coverage for each data set."

Especially I can't understand the method of normalization "... a normalized overall average depth". It just df$norm_depth <- with(df, (depth/median(over_all_avr_depth))) or not???

I'll be grateful for any links, suggestions, and scripts in Python or R.

R next-gen • 4.8k views

ADD COMMENT • link updated 8.0 years ago by Brian Bushnell 20k • written 8.0 years ago by konstantinkul ▴ 110

score 0 · Answer 1 · 2016-12-04

0

Entering edit mode

8.0 years ago

Brian Bushnell 20k

It looks like the graph is trying to show that as you increase average depth (in other words, the number of reads sequenced), a higher percentage of the genome is covered by at least 1 read. But some preparations have biased distributions and thus need more depth to achieve a given level of genome coverage. You could certainly generate such a graph by mapping your reads, then subsampling the sam file to various depths, and calculating the genome coverage for each subsample (for example, BBMap's pileup.sh tool will calculate the percent genome coverage, and reformat.sh can subsample a sam file to various numbers of reads).

I find this particular graph confusing, though; there does not seem to me to be any point in using normalized rather than raw depth (other than to obfuscate), and the "theoretical" line that is vertical at 1.0 does not make any sense to me.

ADD COMMENT • link 8.0 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you for your reply. I have similar doubts about "theoretical" line. Let me explain my pipeline 1. I create coverage column for each position of ref genome $ bedtools coverage -abam some.bam -b fasta.tab -d > cov.txt extrat last column cut -f 6 cov.txt > genome_depth 2. In R tab <- read.table("genome_depth", header = FALSE)

Then I created frequency table of different depths freq_tab <- as.data.frame(table(tab$V1))

and calculate percent of each length with specific depth value in this table freq_tab$perc <- with(freq_tab, (100*Freq)/length(tab$V1))

Then I apply simple normalisation by dividing each depth by median depth dc <- as.numeric(levels(freq_tab$Var1)) and freq_tap$scale_depth <- with(freq_tab, (dc/median(dc)))

And plot this data plot(freq_tab$scale_depth, cumsum(freq_tab$perc),type="l")

http://prntscr.com/dfawi2

What do you think is this correct way to obtain such graph? When I add the second sample I get this plot http://prntscr.com/dfb05s

They are similar to what I want, but graphics from different samples in the example picture intersect at a certain point which conveniently display slight differences in the slope.

ADD REPLY • link 8.0 years ago by konstantinkul ▴ 110