Making a box/density plot for G/C content at millions of genomic locations?
1
0
Entering edit mode
7.5 years ago
mmmmcandrew ▴ 200

Hi all-

I have 4 populations containing millions of bed intervals, each containing a different average G/C content. I would like to make something similar to a box or density plot. On the x-axis, I would simply have 4 categories (my four populations). On the y-axis, I would have GC content. For any given category/populations, I would like to show the G/C content for each of those millions of intervals within as separate points, or as a density cloud, as well as some kind of marker showing the average G/C content (normalized to base pair content). Can anyone recommend a simple program that I could use to accomplish this? I would prefer not to use R if possible, as I'm very clumsy with it.

boxplot GC content • 1.7k views
ADD COMMENT
2
Entering edit mode
7.5 years ago

Maybe there is one program that will do all of this, but the approach below should work, I'd think, and it might be a good way to learn a few useful skills:

  1. Convert each of your four BED files to four FASTA files (e.g. with bed2faidx script or similar that queries samtools faidx-indexed FASTA files)
  2. Run each FASTA file against a GC content script (e.g., such as with the second awk script here) which gives you a content value for each sequence. You could pipe this to awk again to print the population name in one column, and the fractional GC content value in the second.
  3. Use cat to merge all the population GC values into one file. Import this file into R. Make the population name column into factors so you can use their names as a variable (or "category"). Use the ggplot2 library to make a box or violin plot against the population variable, perhaps labeling with median and first and third IQR values.
ADD COMMENT

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6