Question

Genome Wide Plots In [R]

4

Entering edit mode

12.7 years ago

Zev.Kronenberg 12k

Greetings,

I am trying to plot Fst values against genomic position across a 1.3g genome. Currently I have ~20,000,000 datapoints to plot. Our collaborators want a GWAS style plot.

[R] will do this, however the output PDF is too large to work with.

What approaches do you use to make these figures?

here are some approaches I have been thinking about:

1) subsample data.

2) average across regions - kernel smoothing / mean.

3) Draw it by hand!

plot r • 11k views

ADD COMMENT • link updated 12.7 years ago by bdemarest ▴ 460 • written 12.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

can you break it down into one file per chromosome or scaffold, did you try svg() instead of pdf, one file per scaffold?

ADD REPLY • link 12.7 years ago by Michael 55k

0

Entering edit mode

Check this post: http://biostar.stackexchange.com/questions/18285/r-manhattan-plot-of-fst-values-instead-of-logp

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 12.7 years ago by Maxime Lamontagne ★ 2.4k

0

Entering edit mode

Yeah, so that only works when you are looking at smaller datasets.

ADD REPLY • link 12.7 years ago by Zev.Kronenberg 12k

Ram · Answer 1 · 2012-04-02

6

Entering edit mode

12.7 years ago

bdemarest ▴ 460

I suggest:

PNG output.
Alpha transparency for raw data.
Smoothing/summary plotted on top of raw data.
For x-value, use the index. Genomic position is not visible or relevant at this resolution.

Here is some R code.

fake_data = rnorm(1e6) + c(rep(0, 290000),
dnorm(seq(-10, 10, length.out=10000)) * 2, 
rep(0, 290000), 
dnorm(seq(-30, 30, length.out=10000)) * 10, 
rep(0, 400000))

png("test.png", width=11, height=4.25, units="in", res=300)
scatter.smooth(fake_data, span=0.01, degree=0, family="gaussian", 
               evaluation=6600, pch=".", col="#00000003")
dev.off()

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 12.7 years ago by bdemarest ▴ 460

0

Entering edit mode

Very nice Brad! Glad to see your on the forums!

ADD REPLY • link 12.7 years ago by Zev.Kronenberg 12k

score 3 · Answer 2 · 2012-04-02

not sure why others are surprised by the desire to visualize full genomewide data - we do it all the time with Manhattan Plots.

i have had similar problems, and believe the pdf format is the bottleneck - there are just too many vector objects to graph. so, take a small hit in the crispness of the graph and do it as jpg or png, both of which are implimented in R using the similar syntax as pdf. also, subsampling may be viable, i have done this at times when i refused to sacrifice the look of a vector graphic (eg - excluding p>.1 from Manhattan Plot).

Ram · Answer 3 · 2012-04-02

Hi Zev,

My main concern to such an approach is: who could and would be motivated to look at the data as whole? While it is possible to skim through a bacterial genome (in a PDF) broken down into several hundred pages, for such a large genome this approach doesn't scale well imo. Also, users of your visualization need most likely relate to a know coordinate system of genes, transcripts, chromosomes, etc. In conclusion, I recommend to install or use a Genome browser including the gene annotations plus the (quantitative?) data to plot and provide this to users, such that everyone can zoom in, scroll around, search and look at their regions of interest, e.g. use GBrowse. If you absolutely have to make a whole genome graphic

The R approach is better suited for making publication ready plots of smaller regions, once interesting regions have been discovered.

Also look here.

score 1 · Answer 4 · 2012-04-02

1

Entering edit mode

12.7 years ago

Damian Kao 16k

What's the intended audience and what medium? I assume this is read coverage data? If this is for a publication, a huge 20million point graph on 8inch wide paper probably would not be very informative. I am assuming there are probably large spans of regions with 0 reads.

I would pick out interesting regions and leave the rest in a supplemental data file or like Michael suggest, a gbrowse database.

ADD COMMENT • link 12.7 years ago by Damian Kao 16k

0

Entering edit mode

I feel the same way, but our collaborators want it :(. I used FST to identify several interesting regions across the genome.

ADD REPLY • link 12.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Subsample the data or skip regions with 0 coverage would probably be your best bet then. I am actually trying to developing a tool to visualize this type of data in javascript/html right now. It's a standalone app that anyone can just open with a browser without needing a server backend. Seems like there is a demand for something like this.

ADD REPLY • link 12.7 years ago by Damian Kao 16k

score 0 · Answer 5 · 2012-04-02

I had a similar problem in the past. I had millions of whole genome points and i wanted to (1) show all the points, (2) save to PDF, (3) keep all text and lines as vector objects. The PDF (or AI) file with all the points was so heavy that it was almost impossible to work with.

End up with the following procedure:

Temporary hide all objects to be kept as vectores (such as axes, text labels, etc.)
Rasterize the rest - all the points - to an image
Put the vector objects back on the top of the image.

In this case I've got pretty small and quickly loading PDF file. Coded that in MATLAB (I believe R can do it too.) It takes some time but doable.