I have the following problem. I need to compute and visualize cytosine methylation levels around splice junctions. I already used TopHat to determine splice junctions. Therefor I used RNA-Seq reads (obtained from spleen cell) downloaded at Roadmap Epigenomics Project and the hg19 as reference genome. I also found the methylation data obtained from the same spleen sample at this homepage. They are provided raw reads (SRR files) and as processed data in bed and wig format. I would prefer to extract the processed methylation data out of the bed/wig file instead of using the raw data, so I would like to know:
Is there any good tool to extract the methylation data out of the bed/wig file and map it to the regions +200 nt and -200 nt left and right of the splice junctions previously computed with TopHat?
Is there a good tool for visualization?
Which tool can I use for statistical analysis? Maybe R Package(s)?
The workflow actually turns out to be very simple, but depending on your hardware it might be very time-consuming. I did the following steps:
Download RNA-Seq data in SRA-format from GEO (i took data from spleen)
Extract reads with the SRA toolkit (take care of library type!!! single-end or paired-end)
Download reference genome from UCSC as 2bit version (in my case human genome hg19)
Convert reference genome to FASTA format with twoBitToFa (available at UCSC)
Index the reference genome with SAMtools
Map the RNA-Seq reads to the reference with TopHat (it uses the SAM index for that)
Download BS-Seq data from the very same sample (if possible) from GEO
If methylation data is given in WIG format, convert to BED first
Now you can use BEDOPS to map methylation data to TopHat's splice junctions like described by Alex Reynolds below
Use R or something else to extract data, do statistical analysis ...
In addition to step 9 and 10 I also implemented a python script (using BioPython) to extract additional data at the splice junctions, like GC and CpG ratio, FPKM (obtained with a tool called Cufflinks) and some more properties
If you only have Wiggle data, you could use convert2bed (wig2bed) to convert Wiggle to BED:
$ wig2bed < methylation.wig > methylation.bed
You could use bedops --range to expand the range of sorted, BED-formatted splice junctions, and then pipe this to bedmap to map methylation signal to the padded junctions:
Once you have signal data, you could read this into R with read.table() or similar to calculate mean, variance, etc. per padded junction. You could use R to make histograms of signal or summary statistics over all junctions, etc.
Thank you so much, that sounds like a very good plan. I will try BEDOPS immediately.
For the subsequent visualization with R, there is one more thing that im not 100% sure about. I need to digitize the methylation levels in 20 bp bins around the splice junctions (within range +/- 200 bp). Is the methylation level within a bin defined as [methylated C's] / [methylated C's + unmethylated C's] ?
Since you didn't get a reply the last time you posted more or less the same thing the likely answer is that you'll have to do a bit of coding.
1. You might be able to use deepTools for this by converting to bigWig. Feeding the splice sites as an appropriate BED file into computeMatrix should allow plotting the profile, which is what I presume you want.
2. If (1) doesn't suffice for you you'll have to code together the extraction part and plot things in R.
3. You need to define "statistical analysis" before anyone can help you. There are too many possibilities for what that could mean.
Yes that is absolutely true, question 3 was a little bit unclear in this context. What i meant was the "normal" statistic calculations like mean, variance etc...
Nevertheless i thank you very much for helping me. For now, i will try the solution from Alex Reynolds (the post above) since i already installed BEDOPS on my PC. At the moment it seems to be the easiest way to solve my task.
Hi, can you share your workflow for this problem if you have figure it out? I want to do a very similar thing.
Hi,
The workflow actually turns out to be very simple, but depending on your hardware it might be very time-consuming. I did the following steps:
In addition to step 9 and 10 I also implemented a python script (using BioPython) to extract additional data at the splice junctions, like GC and CpG ratio, FPKM (obtained with a tool called Cufflinks) and some more properties