Hello, does anybody have suggestions for how to simulate RRBS reads with a known methylation level? I have a reference for a non-model species and want to make sure reads map appropriately before beginning data generation.
There is a bisulfite sequence simulator that may help. What you also need is a BED file of regions that are well covered in the genome with RRBS. Then use bedtools getFasta to extract the sequence of all RRBS covered regions. Then use Sherman to generate simulated bisulfite seq data for these regions.
theoretically, for directional RRBS library, reads start with CGG or TGG. why in a real experiment we see reads start from 3 different bases (although very few of them)? Thanks.
Star activity (MspI or whatever else is used cutting outside its normal site) or the general degradation caused by bisulfite treatment are the likely culprits in my mind.
Ideally RRBS should select genome fragments starting and ending with CCGG and of size between, say, 50 and 1000bp, right? So you could first extract such fragments and produce a reference genome containing only these fragments. Then you could run Sherman as suggested by mark.ziemann on this restricted reference. By the way, if you just want to know how many theoretical fragments you can obtain from RRBS and whether these are mappable you don't need to assume a particular methylation level since BS-Seq mappers bioinformatically "demethylate" all the reads before mapping.
I wrote a simple program to extract sequences from a fasta file matching a regular expression (fastaRegexFinder.py). To obtain RRBS fragments from a reference you could use something like:
theoretically, for directional RRBS library, reads start with CGG or TGG. why in a real experiment we see reads start from 3 different bases (although very few of them)? Thanks.
Star activity (MspI or whatever else is used cutting outside its normal site) or the general degradation caused by bisulfite treatment are the likely culprits in my mind.