counting number of CpGs that are methylated and unmethylated in a read of whole genome bisulfite data
1
2
Entering edit mode
5.5 years ago

Hi all,

I need to get the number of CpGs that are methylated or, unmethylated per read of WGBS data. I am trying the calculate PDR values as described here. So far I made some progress but need some help from you guys. I can see this methylated or, unmethylated Cs on IGV. But I was wondering how can I get the information per reads.

paste <(samtools view file.bam |  awk '{print $3 "\t" $4 "\t" $4+length($10) "\t" $10}') <(bedtools bamtobed -i file.bam | awk '{print $6, $4}') | head -1000 | intersectBed -a CG.bed
-b stdin -wa -wb | awk '{print $0, length($7), $2-$5,  substr( $7, (($2-$5)), 2)}' | head

Note that CG.bed is the coordinates for CpGs in human genome. Last three columns are: read length, CpG location in read and 2 bases from that location. Output is:

chr1    10468   10470   chr1    10368   10469   CTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCC   + HWI-ST700660:309:C7T40ACXX:6:1202:9409:92454/2 101 100 CC
chr1    10468   10470   chr1    10368   10469   CTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTAACCCTAACCCTAACCCTA   + HWI-ST700660:309:C7T40ACXX:5:1107:12258:97118/2 101 100 TA
chr1    10468   10470   chr1    10369   10470   TAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTA   - HWI-ST699:265:C7RM3ACXX:6:2308:9444:13430/1 101 99 CT
chr1    10468   10470   chr1    10369   10470   TAATTTTAATTTTAATTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTATTTTAATTTTAATTTTAATTTTNATTTT   - HWI-ST539:254:C7T4WACXX:2:1204:2095:36010/1 101 99 TT
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTA   - HWI-ST699:265:C7RM3ACXX:7:1307:13501:63031/1 101 98 CC
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC   + HWI-ST700660:309:C7T40ACXX:6:2306:15611:23135/2 101 98 TA
chr1    10468   10470   chr1    10370   10471   AACCCTAACCCTAACCCGATCCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCTAACCCTAACCCCT   - HWI-ST699:265:C7RM3ACXX:7:2313:19303:78256/1 101 98 CC
chr1    10468   10470   chr1    10371   10472   ATTTTAATTTTTAATTTTTAATTTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTCGAATTTTTAATTCGAATTTTATTCGA   - HWI-ST699:265:C7RM3ACXX:6:2101:2576:51036/1 101 97 TT
chr1    10468   10470   chr1    10371   10472   ACCCTAACCCTAACCCTAACCCCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAGCCCTAACCCTAACCCTAACCCT   - HWI-ST700660:310:C7T58ACXX:4:2115:12614:9155/1 101 97 AC
chr1    10468   10470   chr1    10371   10472   ACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAA   - HWI-ST539:254:C7T4WACXX:1:2211:11231:19289/1 101 97 CC

Two issues with this code are:

1) it is taking into consideration of soft clipped reads which I want to ignore. 2) indels may shift the position of CpGs in the reads which I am not considering. As a consequence, in the last column I can see the dyads of CC and TT which is not expected (ignoring SNVs right now) at a CpG location.

In summary, I want to get the number of CpGs methylated or, unmethylation per reads. Thanks!

PDR (Fig C,E) Landau DA et al, Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia. Cancer Cell. 2014 Dec 8;26(6):813-825:

Fig 1: PDR (Fig C,E)

Fig2: IGV screen shot of my data:

Fig 1

Fig 2

next-gen sequencing genome • 2.6k views
ADD COMMENT
0
Entering edit mode

Interesting idea. I plan to analyze this, not sure what it might (potentially reflects heterogeneity) mean. Would you minding sharing your insight from your analysis or refer to your paper if it is already published.

ADD REPLY
5
Entering edit mode
5.5 years ago

Have a look at MethylDackel perRead, which will do what you want properly.

ADD COMMENT
0
Entering edit mode

Thanks a lot for pointing me such a nice tool. The following command worked for me is:

MethylDackel perRead hg38.fa wgbs.bam

The output example is below. Can I interpret the output like that- read name, chr, start position of the read. For last two columns- there are 3 CpGs in those reads but in first and second reads 2/3 of the CpGs are methylated, for 3rd and 4th reads all 3/3 CpGs are methylated.

^CHS16_68:4:2113:9438:23805 chr10   4412865 66.666667   3
HS16_68:4:2309:8702:70123   chr10   4412866 66.666667   3
HS17_48:2:1103:6100:15523   chr10   4412872 100.000000  3
HS16_68:3:2109:20578:17792  chr10   4412872 100.000000  3
ADD REPLY
0
Entering edit mode

Yes, that's correct.

ADD REPLY
0
Entering edit mode

Thanks again. That was very helpful.

ADD REPLY
0
Entering edit mode

Hi Devon Ryan, I've been using methylDackel extract for extracting methylation calls in our targeted bisulfite sequencing data. Thanks for such a great tool! We are now interested in looking at haplotypes with CpG islands and I would like to know if/how perRead output can be used for our purpose. Thanks.

ADD REPLY
0
Entering edit mode

Which BAM file should be ideally used? .bam, .sorted.bam, .marked.sorted.bam, .dedup.sorted.bam etc. ?

ADD REPLY
0
Entering edit mode

.sorted.bam or .dedup.sorted.bam if you already have it and want to ignore possible PCR duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6