Hi, I have bisulfite converted sequencing reads and, from my bam file, would like to select reads that end at a Msp1 digest site (C^CGG). The way I would think to go about doing that is making a bed file of CCGG sites in the genome (from the fasta file) and then selecting reads from the bam file that end at those sites. However, I'm not sure how to do either part of this.
Can anyone help me out or do you have better ideas? Thanks!
Yes, that would be correct- I mistyped in the initial question and have edited it.
How would I go about constructing that bed file? As for getting the reads that span that site, would I do that by reformatting the bam file into a bed-like format (adding the length of the read to the inital site so I check the end) or is there a tool I could use on the bam file itself?
For generating the BED file, you can use biopython, which can determine cut sites in DNA.
For finding the reads, just use samtools with the BED file and the BAM file. If you want anything more exact, then write a little script in pysam, though that'd be largely overkill.
I don't think samtools does what I want it to here. I want reads that end at an Msp1 digest site, not those that include a digest site anywhere within the read.
Using biopython works perfectly for the first part, thank you!
I know, but you'll probably find that 99% of the overlapping alignments end on a cut site, so you can save yourself the added time and ignore the 1% extra. If the extra precision is important then you can code something with pysam.