Hello,
I have a WGBS data from plant species. (2 treatments and 1 control with 3 replicates each). I aligned my data to genome using Bismark and extracted the methylation information using MethylDackel. I'm also intereated in CHH and CHG sites so I used --CHG --CHH
. But I find that the CHH files are insanely huge (>50Gb file each sample). I did the analysis on CG sites (~8Gb per sample) using R package MethylKit and DSS, which is already incredibaly slow and RAM-consuming. I cannot image how can I load the CHH data (6 samples for one analysis) into R to do the analysis. How can I deal with this data? I never analyzed WGBS data before and no one in our lab have the experience. Is it normal to get this large data?
I'm also aware of the C based tool CGmapTools, which might be possible for the analysis. But seems it cannot allocate multiple cores for the analysis, and it uses Fisher test which doesn't need replicates.I don't think it would be better for statistical analysis to merge the replicate when you have the biological replicate information. Is there other tools reconmmended for big data analysis?
Thanks
Ziliang
If you're interested in finding differentially methylated regions, my experience is that the metilene package is fast and not so RAM consuming as R. You can preprocess your Bismark callings with awk or something to get the right input to metilene, thus avoiding R.
Thanks, I'll give it a try.