Filter only overlapping portion of read pairs?
0
0
Entering edit mode
23 months ago
Tom Ellis • 0

I am working with paired-end bisulphite-treated short-read data. I would like to extract only the portion of the reads where the two reads overlap. I have raw fastq files, and sorted bams after alignment using Bismark. Just clipping N bp from the 5 or 3 prime end isn't a robust solution because the length of the non-overlapping regions vary depending on insert size.

Can anyone suggest away to just extract the portion of read pairs that overlap with one another?

Anticipating that someone will ask: the reason I want to do this is that I'm looking at data from a new bisulphite protocol, and it looks like the methylation calls are only reliable in the area where the reads overlap. I have not worked out why.

bisulfite paired-end • 1.2k views
ADD COMMENT
0
Entering edit mode

Name sort the BAM file, and then using biopython (Bio.AlignIO) you can find the overlap region by taking the maximum start value, and minimum end value for the pairs. Obviously check each pair for overlap first before doing this.

ADD REPLY
0
Entering edit mode

Thanks. When you say 'name sort' do you mean something like samtools sort?

ADD REPLY
0
Entering edit mode

Yup, I'd say so. Name sorting is done with samtools sort -n.

ADD REPLY
0
Entering edit mode

Also it seems that Bio.AlignIO can't handle bam files directly, so i'd have to convert to another format first. What would be a sensible format to convert to that doesn't destroy the alignment information? I am new to this and the array of formats is still confusing

ADD REPLY
0
Entering edit mode

Rather than Bio.AlignIO, I would recommend parsing the file with pysam and then using get_overlap (see here) find out if consecutive reads overlap

ADD REPLY

Login before adding your answer.

Traffic: 2676 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6