Hey, just a quick question,
So I have some RRBS data that I downloaded from SRA ( matched normal tumor pairs ) which I want to analyze to gain some experience working with this type of data. I decided to use bismark for aligning followed by metilene for DMR identification. While aligning in bismark, I saw there was an option --non_directional
to specify if the reads came from a non-directional library. Here is the publication containing the data whose methods refer to this paper for the library prep. Now reading the Material and Methods section, I got no idea if it is a directional prep or not.
Is there a way to figure out what is the directionality of the data? (Although the bismark documentation says that most Illumina RRBS preps are directional and --non_directional
is set to OFF by default).
By running the same sample in both directional and non-directionalmode, can I figure out the directionality?
Thanks a lot for your help. Cheers.
Check this answer by the Bismark author.
Thanks for that. So I checked the same sample in both directional and non-directional alignment and found that the mapping efficiency stays nearly the same (~68%). So the library is probably directional. I was worried because the methylation percentage of C in CpG was only 36% ( CHH and CHG are lower ~ 5%). I read the CpG methylation in mammals is ~70%. Although this is a tumor sample, so do you have any idea whether this is normal? Thanks again,
Yes, it is normal! Remember that RRBS does not profile the whole genome in an unbiased manner. Instead, it generates enzimatically-ennriched libraries of mostly CpG-dense regions (such as CpG islands), and CpG-dense regions have lower methylation levels than CpG-sparse regions in general.
Oh, that makes perfect sense!! Just one last doubt that I had. Now, while running FASTQC on my data I saw that the adapters were trimmed from all the reads and the authors stated that trimming was done using Trim Galore! I also read that for RRBS data, two extra bases from the 3' end of R1 and 5' end of R2 should be trimmed to prevent bias arising due to the methylation state of the C used for end repair. I don't know whether this step was performed on the data or not (the authors just stated that Trim Galore! was used).
How significant is the bias which arises if we do not trim the two bases?
Is there a way to check if the two base trimming was done for my data? Can we look for a signal in the per base sequence content of FASTQC to infer this?
I am attaching the per base sequence content for one of my samples. The other samples have a similar profile.
Thanks a ton!
If they used Trim Galore for trimming, they probably did that "special" trimming, because Trim Galore has an specific mode for trimming RRBS data.
Nonetheless, in the plot they seem to be 100 bp reads. If this is the case, if for R2 you see that the maximum sequence length goes only up to 98 bp (not 100 bp), you would have more confidence that they have had hard-trimmed those 2 bp specific for RRBS.
But, furthermore, looking at the fastQC sequence content profiles you show, we can see that they have been trimmed. The "spikes" you see in the reads are due to the enzymatic MspI digestion, which cuts DNA at an specific motif. In the following image I show you data of how paired-end RRBS looks before and after trimming: you can see very clearly how the 2 bp from 5' of R2 are removed. For 3' R1 it is harder to see because some reads also have adapter trimming, etc.
Yes, the images make it very clear that 5' end of R2 has been hard trimmed. So I can safely assume that the data has been trimmed and then uploaded.
Thanks so much for your patience. Much appreciated. Cheers!
no problem! good luck with the analyses ;)
Hey, sorry to be bothering you so soon again.
So I went ahead with the methylation extraction using Bismark which produced an M bias plot looking like this ( other sample plots look similar ).
As seen, R1 has a severe CHG methylation bias at the 5' end while R2 has biases at both ends of the read. So to remove them, I did the extraction step again with the
--ignore
flag inbismark_methylation_extractor
. Here is the full command that I usedOT and OB strands were merged and 1 base from 5' end of R1 was ignored while 2 bases from 5' end and 1 base from 3' end of R2 was ignored. This gave a better M bias plot as shown
Thanks again.
RRBS M-bias plots in general look strange. Check out these answers by the author: one (where he does not recommend ignoring any more positions for RRBS data), two.
The "roughness" across the reads is probably due to the low complexity of the RRBS data (maybe duplicated seqs or targeting repetitive seqs or things like that). The spikes at the end are argued to be from adapter trimming, although I would also expect to see a 3' spike in your previous Per Base Sequence Content plots. Also, spikes near the 5' ends are usually due to very low numbers of counts/calls (you may see it in the .txt file from which the M-bias plot is plotted) so probably less relevant.
Can the Per base sequence content module in FastQC be used to visually differentiate if a data is from WGBS or RRBS? I usually observed that for WGBS, the lines are usually flat while in this case its more of zigzag.
The per base sequence content is indeed usually flatter in WGBS because the library is much more complex (i.e. contains many more different/varied sequences). Though I've seen some RRBS plots which were quite flat.
Nonetheless, particularly in the untrimmed data, you will especially distinguish RRBS by the sequence bias at the first 3 bases of the 5' end due to all reads starting at C/TGG sites.