Genome Wide Plots In [R]
5
4
Entering edit mode
12.7 years ago

Greetings,

I am trying to plot Fst values against genomic position across a 1.3g genome. Currently I have ~20,000,000 datapoints to plot. Our collaborators want a GWAS style plot.

[R] will do this, however the output PDF is too large to work with.

What approaches do you use to make these figures?

here are some approaches I have been thinking about:

1) subsample data.

2) average across regions - kernel smoothing / mean.

3) Draw it by hand!

plot r • 11k views
ADD COMMENT
0
Entering edit mode

can you break it down into one file per chromosome or scaffold, did you try svg() instead of pdf, one file per scaffold?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Yeah, so that only works when you are looking at smaller datasets.

ADD REPLY
6
Entering edit mode
12.7 years ago
bdemarest ▴ 460

I suggest:

  1. PNG output.
  2. Alpha transparency for raw data.
  3. Smoothing/summary plotted on top of raw data.
  4. For x-value, use the index. Genomic position is not visible or relevant at this resolution.

Here is some R code.

fake_data = rnorm(1e6) + c(rep(0, 290000),
dnorm(seq(-10, 10, length.out=10000)) * 2, 
rep(0, 290000), 
dnorm(seq(-30, 30, length.out=10000)) * 10, 
rep(0, 400000))

png("test.png", width=11, height=4.25, units="in", res=300)
scatter.smooth(fake_data, span=0.01, degree=0, family="gaussian", 
               evaluation=6600, pch=".", col="#00000003")
dev.off()
ADD COMMENT
0
Entering edit mode

Very nice Brad! Glad to see your on the forums!

ADD REPLY
3
Entering edit mode
12.7 years ago
Occam ▴ 410

not sure why others are surprised by the desire to visualize full genomewide data - we do it all the time with Manhattan Plots.

i have had similar problems, and believe the pdf format is the bottleneck - there are just too many vector objects to graph. so, take a small hit in the crispness of the graph and do it as jpg or png, both of which are implimented in R using the similar syntax as pdf. also, subsampling may be viable, i have done this at times when i refused to sacrifice the look of a vector graphic (eg - excluding p>.1 from Manhattan Plot).

ADD COMMENT
2
Entering edit mode
12.7 years ago
Michael 55k

Hi Zev,

My main concern to such an approach is: who could and would be motivated to look at the data as whole? While it is possible to skim through a bacterial genome (in a PDF) broken down into several hundred pages, for such a large genome this approach doesn't scale well imo. Also, users of your visualization need most likely relate to a know coordinate system of genes, transcripts, chromosomes, etc. In conclusion, I recommend to install or use a Genome browser including the gene annotations plus the (quantitative?) data to plot and provide this to users, such that everyone can zoom in, scroll around, search and look at their regions of interest, e.g. use GBrowse. If you absolutely have to make a whole genome graphic

The R approach is better suited for making publication ready plots of smaller regions, once interesting regions have been discovered.

Also look here.

ADD COMMENT
0
Entering edit mode

Yes, completely agree. The problem is our collaborators see GWAS style plots and they want it for Fst. A SNP chip has much less data than a WGS dat GWAS.

ADD REPLY
1
Entering edit mode
12.7 years ago

What's the intended audience and what medium? I assume this is read coverage data? If this is for a publication, a huge 20million point graph on 8inch wide paper probably would not be very informative. I am assuming there are probably large spans of regions with 0 reads.

I would pick out interesting regions and leave the rest in a supplemental data file or like Michael suggest, a gbrowse database.

ADD COMMENT
0
Entering edit mode

I feel the same way, but our collaborators want it :(. I used FST to identify several interesting regions across the genome.

ADD REPLY
0
Entering edit mode

Subsample the data or skip regions with 0 coverage would probably be your best bet then. I am actually trying to developing a tool to visualize this type of data in javascript/html right now. It's a standalone app that anyone can just open with a browser without needing a server backend. Seems like there is a demand for something like this.

ADD REPLY
0
Entering edit mode
12.7 years ago
Yuri ★ 1.7k

I had a similar problem in the past. I had millions of whole genome points and i wanted to (1) show all the points, (2) save to PDF, (3) keep all text and lines as vector objects. The PDF (or AI) file with all the points was so heavy that it was almost impossible to work with.

End up with the following procedure:

  1. Temporary hide all objects to be kept as vectores (such as axes, text labels, etc.)
  2. Rasterize the rest - all the points - to an image
  3. Put the vector objects back on the top of the image.

In this case I've got pretty small and quickly loading PDF file. Coded that in MATLAB (I believe R can do it too.) It takes some time but doable.

ADD COMMENT

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6