I am working with paired-end bisulphite-treated short-read data. I would like to extract only the portion of the reads where the two reads overlap. I have raw fastq files, and sorted bams after alignment using Bismark. Just clipping N bp from the 5 or 3 prime end isn't a robust solution because the length of the non-overlapping regions vary depending on insert size.
Can anyone suggest away to just extract the portion of read pairs that overlap with one another?
Anticipating that someone will ask: the reason I want to do this is that I'm looking at data from a new bisulphite protocol, and it looks like the methylation calls are only reliable in the area where the reads overlap. I have not worked out why.
Name sort the BAM file, and then using biopython (
Bio.AlignIO
) you can find the overlap region by taking the maximum start value, and minimum end value for the pairs. Obviously check each pair for overlap first before doing this.Thanks. When you say 'name sort' do you mean something like
samtools sort
?Yup, I'd say so. Name sorting is done with
samtools sort -n
.Also it seems that
Bio.AlignIO
can't handle bam files directly, so i'd have to convert to another format first. What would be a sensible format to convert to that doesn't destroy the alignment information? I am new to this and the array of formats is still confusingRather than Bio.AlignIO, I would recommend parsing the file with
pysam
and then usingget_overlap
(see here) find out if consecutive reads overlap