Hi, I'm using bismark for bs-seq mapping and methylation calling. I take a glance at the coverage file generated by bismark2bedGraph. However, the result seems strange. Take CpG site of chrM for example (because chrM is the smallest chromosome). The coverage file contains sites that are not CpG sites. Here is a part of one CpG coverage file.
chrM 96 96 0 0 95
chrM 97 97 0 0 92
chrM 100 100 0 0 1
chrM 105 105 0 0 92
chrM 106 106 0 0 93
... ...
chrM 2003 2003 0 0 170
chrM 2004 2004 0 0 116
chrM 2107 2107 0 0 1
chrM 2204 2204 0 0 163
chrM 2205 2205 0 0 156
... ...
chrM 2826 2826 0 0 125
chrM 2827 2827 0 0 147
chrM 2833 2833 0 0 1
chrM 2845 2845 0 0 130
chrM 2846 2846 0 0 142
It's strange that site 100,2107,2833 are mixed in the CpG coverage file. And on the same sites of the reference, these three sites are not CpG sites. (here I call it false CpG.)
Then I count total CpG sites on chrM. There are 439 CpG sites on the plus strand, whick means 878 CpG on both strands. Here is the code.
input = open ("chrM",'r').read()
cg = 0
total = 0
print 'Num\tStart\tEnd'
for i in range(0, len(input)):
if i <= 16570:
substr=input[i:i+2]
if substr.lower() == 'cg' :
total += 1
print '%s\t%s\t%s'%(total,i+1,i+2)
So number of lines in chrM CpG coverage file should always be equal to or less than 878. But the truth is, for two bs-seq data, one is 885, another is 900. Both of them contains false CpG. And the sites are different between them. For example, site 100 in the first file doesn't appear in the second one. Another, all of the false CpG only have one unmethylated reads.
Here are my commands (%s represents parameters):
bismark --multicore %s --output_dir %s %s -1 %s -2 %s
bismark_methylation_extractor -p --no_overlap --comprehensive --multicore %s --ignore %s --ignore_r2 %s --ignore_3prime %s --ignore_3prime_r2 %s --output %s %s
bismark2bedGraph --dir %s --remove_spaces --CX --ample_memory --output CpG %s
First, I use --comprehensive to produce CpG,CHH and CHG context output files. Then I use bismark2bedGraph and --CX to process CpG,CHH and CHG context output files separately. In this way I can get CpG,CHH and CHG coverage files separately.
In addition, my version is v0.16.1. I haven't test it using the newest one.
Is this a bug or am I using the wrong commands ? Thank you very much for sharing your idea.
Which Genomic Version? hg19 or hg38?
I suspect this is an odd bug in bismark. What happens if you use a different too (e.g., PileOMeth)?