Help simulating RRBS reads
2
0
Entering edit mode
9.4 years ago
Jautis ▴ 580

Hello, does anybody have suggestions for how to simulate RRBS reads with a known methylation level? I have a reference for a non-model species and want to make sure reads map appropriately before beginning data generation.

Anything helps, thank you!

methylation RRBS • 2.8k views
ADD COMMENT
2
Entering edit mode
9.4 years ago
mark.ziemann ★ 1.9k

There is a bisulfite sequence simulator that may help. What you also need is a BED file of regions that are well covered in the genome with RRBS. Then use bedtools getFasta to extract the sequence of all RRBS covered regions. Then use Sherman to generate simulated bisulfite seq data for these regions.

ADD COMMENT
1
Entering edit mode

theoretically, for directional RRBS library, reads start with CGG or TGG. why in a real experiment we see reads start from 3 different bases (although very few of them)? Thanks.

ADD REPLY
2
Entering edit mode

Star activity (MspI or whatever else is used cutting outside its normal site) or the general degradation caused by bisulfite treatment are the likely culprits in my mind.

ADD REPLY
1
Entering edit mode
9.4 years ago

Ideally RRBS should select genome fragments starting and ending with CCGG and of size between, say, 50 and 1000bp, right? So you could first extract such fragments and produce a reference genome containing only these fragments. Then you could run Sherman as suggested by mark.ziemann on this restricted reference. By the way, if you just want to know how many theoretical fragments you can obtain from RRBS and whether these are mappable you don't need to assume a particular methylation level since BS-Seq mappers bioinformatically "demethylate" all the reads before mapping.

I wrote a simple program to extract sequences from a fasta file matching a regular expression (fastaRegexFinder.py). To obtain RRBS fragments from a reference you could use something like:

fastaRegexFinder.py -q -r 'CCGG(.(?<!CCGG)){50,1000}(?=CCGG)' -f genome.fa --noreverse

The extracted sequences are in the last column of output, you should append CCGG to these. Example:

cat genome.fa
>seq1
zzCCGGaaaCCGGbbbbCCGGxxxxxxCCGGdddCCGGeeeeCCGGCCGGfffCCGGhhhhCCGGiiiiiCCGGlllCCGGmmmmCCGGzz
>seq2
zzCCGGaaaaaaaCCGGbbbbbbbCCGGzzzzzzzzzzzzzzzzzzzzzzzzzzzzzCCGGaaa
>seq3
zzCCGGaaaaaaaCCGGCCGGbbbbbbbCCGGzzz

fastaRegexFinder.py -q -r 'CCGG(.(?<!CCGG)){5,10}(?=CCGG)' -f genome.fa --noreverse
seq1    17    27    seq1_17_27_for    10    +    CCGGxxxxxx
seq1    61    70    seq1_61_70_for    9    +    CCGGiiiii
seq2    2    13    seq2_2_13_for    11    +    CCGGaaaaaaa
seq2    13    24    seq2_13_24_for    11    +    CCGGbbbbbbb
seq3    2    13    seq3_2_13_for    11    +    CCGGaaaaaaa
seq3    17    28    seq3_17_28_for    11    +    CCGGbbbbbbb

Did I get it right? does it make sense?

ADD COMMENT

Login before adding your answer.

Traffic: 808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6