If I were to create a plot representing a genomic region where I have two axes representing say, depth of coverage in x, and GC content on y for example, and since the region is big I have to do it by windows, where each window would correspond to the mean depth of coverage of 10000 base pairs.
In the case I decide to create another set of 10kbp windows, but this time starting at position 5k so that effectively each new window overlaps two old neighbouring windows by 5kbp, what kind of transformations should I make to my data, since each region is effectively being represented twice? Just normalize it?
Can you refer me to papers which make use of this kind of exploration region-wise, and show different ways on how to best capture the true variation within a genomic region, and minimize the possible errors that overlapping windows may introduce (such as signal-to-noise approaches)?
I meant the coverage/gc% ratio for base, but since plotting the 10 million points would result in a hard and slow to render plot, full of indistinguishable peaks, I instead wanted to smooth it and understand where actually there may be the regions with excess variation by reducing chunks of 10k points into a single one.
What you suggest sounds like the sliding window approach, which has been used for inspecting variation before. Here is a reference:
Rozas J, Rozas R: DnaSP, DNA sequence polymorphism: an interactive program for estimating Population Genetics parameters from DNA sequence data.
Comput Appl Biosci 1995, 11:621-625.
We wrote a tool for genome-wide analysis of polymorphisms a while ago, VariScan, which implements two kinds of sliding windows: one that is fixed for genomic stretches, and one that fixes the number of polymorphisms per window:
Incidentally what made me ask this question was exactly when I saw Rozas questioning a PhD candidate, in his thesis defense, about how he exactly constructed the sliding window approach since the candidate himself wasn't totally aware of the normalization issue.
I am not entirely sure if this answers your question but I think you could at best use a sliding window approach for that. So for every base you calculate the average of coverage of the region 1/2w - x - 1/2w. In this way we usually assess overrepresented coverage sections in our genome sequencing projects. it filters the noise and depending on the window size you will be able to detect significant differences (i.e. <>2sd) of a feature size roughly comparable to the choosen window-size; our formula is basically like this:
[?][?]u(i) = 1/(Nwindow+1) * SUM(Xi+m)
where u(i) is average of window at position i
Nwindow is the window size choosen (since the window is actuall 1/2 before and 1/2 after we need to add 1 later
SUM(Xim) is the sum of coverage (=X) from positions i minus 1/2 windows size till i plus 1/2 windows size (half window size we defined here as m starting at -1/2N till +1/2N
So this deals only with the coverage issue (y-axis). You pobably can do the same for GC content and plot either against each other in for instance a 3D graph over position i.
PS: sorry for complex explanation. Just simply cannot fit a proper Sigma function in here...
PPS: please note that in case of circular genomes you need to make sure the window goes over to the other end when approaching the start or end within 1/2 windows size!
So are you making a 3d graph? Where x and y are gc content, coverage and z is every 10kb? I don't quite get how the graph is set up.
Maybe you mean to plot the coverage/gc% ratio per base? Or indeed a 3D graph of coverage, GC, and position?
I meant the coverage/gc% ratio for base, but since plotting the 10 million points would result in a hard and slow to render plot, full of indistinguishable peaks, I instead wanted to smooth it and understand where actually there may be the regions with excess variation by reducing chunks of 10k points into a single one.