Iterating through paired-end reads in samtools / pysam
3
1
Entering edit mode
6.8 years ago
olavur ▴ 150

I want to process both reads in a paired-end read simultaneously, but don't know how to do this efficiently.

I'm currently doing it the following way, using the pysam Python wrapper for for the htslib API (Samtools).

samfile = AlignmentFile(filename, 'rb')  # BAM file reader.
# Iterate through reads.
for read in samfile:
    # Check that the read has a pair that is mapped and not a duplicate.
    if read.is_paired and not read.mate_is_unmapped and not read.is_duplicate:
        # Get the other read in the pair.
        read_mate = samfile.mate(read)

This is really slow however.

From what I understand, the samfile.mate(read) call makes the BAM file reader jump to another place in the BAM. So it sounds like it would be a good idea to have two readers open and using one to iterate through the reads and the other to fetch the mate read, but that doesn't speed things up. I don't know whether paired-end reads are adjacent in a sorted and indexed BAM file, in that I case I would be able to take advantage of that, and do something along the lines of:

reads = samfile.fetch()
for i in range(number_of_pairs):
    read1 = next(reads)
    read2 = next(reads)

Anyone know how I can do this in a way that won't take forever to run?

BAM SAM Python • 14k views
ADD COMMENT
1
Entering edit mode

Anyone know how I can do this in a way that won't take forever to run?

sort your bam on query name

ADD REPLY
0
Entering edit mode

I tried this, and from what I understand you can't index a BAM that isn't sorted by position. This makes life quite difficult, as many tools need the BAM to be indexed. Thanks for the suggestion. I'm not sure how to make it work.

ADD REPLY
0
Entering edit mode

okay, why do you need to fetch the mate ? what is the aim of your project ?

ADD REPLY
0
Entering edit mode

I have linked-read data where the reads that are close together have the same barcode. I want to try to check that the paired reads have the same barcode, with the assumption that a barcode mismatch means there were spurious alignments. Among other things, I think this might help me get a better estimate of insert size, assuming that the spurious alignments give false predictions of insert size. I think I will put this project on hold as I feel I lack the understanding to do this properly, but nevertheless I think it would be useful to be able to obtain the mate pairs in an efficient and fast manner.

ADD REPLY
5
Entering edit mode
6.4 years ago
ggydush ▴ 50

My method puts reads into a dictionary and will output reads once a pair is found. Usually pairs are near each other so the dictionary shouldn't get that big.

from collections import defaultdict
import pysam


def read_pair_generator(bam, region_string=None):
    """
    Generate read pairs in a BAM file or within a region string.
    Reads are added to read_dict until a pair is found.
    """
    read_dict = defaultdict(lambda: [None, None])
    for read in bam.fetch(region=region_string):
        if not read.is_proper_pair or read.is_secondary or read.is_supplementary:
            continue
        qname = read.query_name
        if qname not in read_dict:
            if read.is_read1:
                read_dict[qname][0] = read
            else:
                read_dict[qname][1] = read
        else:
            if read.is_read1:
                yield read, read_dict[qname][1]
            else:
                yield read_dict[qname][0], read
            del read_dict[qname]

bam = pysam.AlignmentFile(filename, 'rb')
for read1, read2 in read_pair_generator(bam):
    # do stuff
ADD COMMENT
0
Entering edit mode

Thank you, an elegant solution. Have you used this method a lot?

ADD REPLY
0
Entering edit mode

I use this function anytime I need to iterate in pairs... I've used it on a number of BAMs.

Hope it helps!

ADD REPLY
0
Entering edit mode

@ ggydush your method is useful , but when i want to find all PE reads which cover 1 fixed coordinates or small target region (1-10bp),this method can't find all PE reads,because some PE reads have long insert size than the read length , so when one PE read just read1 or read2 cover the target region , this method don't work;when i expand the target region( sometime up 200-300bp ) to cover more PE reads, it cost more time especially in deep depth bam file,so i want to find all PE reads which cover small target region (1-10bp) and fast ,do you have some advice on this?

ADD REPLY
0
Entering edit mode

This will not work for a BAM file with multiple alignments for the same read pair, am I correct?

ADD REPLY
0
Entering edit mode

I modified a bit based on the @gizone1 's reply. If convenient, pls take a trial: def read_pair_generator(bam, region_string=None): """ Generate read pairs in a BAM file or within a region string. Reads are added to read_dict until a pair is found. """ read_dict = defaultdict(lambda: [None, None]) for align in bam.fetch(until_eof=True, region=region_string): if not align.is_paired: continue qname = align.query_name if qname not in read_dict: if align.is_read1: read_dict[qname][0] = [align] else: read_dict[qname][1] = [align] elif align not in read_dict[qname][0] and align not in read_dict[qname][1]: if align.is_read1: read_dict[qname][0].extend(align) else: read_dict[qname][1].extend(align) else: if align.is_read1: yield read_dict[qname][0], read_dict[qname][1] else: yield read_dict[qname][0], read_dict[qname][1] del read_dict[qname] I change the name from "read" to "align" since each row is an alignment record instead of a read record. Instead of return one pysam.AlignmentSegment object, I returned a list of pysam.AlignmentSegment objects.

And you can call it like this: for read1_aligns, read2_aligns in read_pair_generator(ex_sam):

In this case, read1_aligns will be a python list including all the pysam.AlignmentSegment objects corresponding to this read1. Pls try this out and see if this works.

ADD REPLY
1
Entering edit mode
6.5 years ago
Michi ▴ 990

I agree that it should be a bit simpler, given that paired end bam files are so common.

I used the following solution, which is only applicable if the .bam file has been previously sorted by read names samtools sort -n!

samfile = AlignmentFile(filename, 'rb')  # BAM file reader.
# Iterate through reads.
read1 = None
read2 = None

for read in samfile:

    if not read.is_paired or read.mate_is_unmapped or read.is_duplicate:
          continue

    if read.is_read2:
        read2 = read
    else:
        read1 = read
        read2 = None
        continue

     if not read1 is None and not read2 is None and read1.query_name == readf2.query_name:
          print("found a pair!")
          ## do your stuff
ADD COMMENT
0
Entering edit mode

Pierre Lindenbaum did suggest this in a comment above. Did you try it? Is it fast, and does it get all reads (assuming they're a proper pair, non-duplicate, etc.)? It may be worth the overhead of sorting by query name.

ADD REPLY
0
Entering edit mode
4.8 years ago
u3005579 • 0

This is based on @gizone1's answer.

def read_pair_generator(bam, region_string=None):
"""
Generate read pairs in a BAM file or within a region string.
Reads are added to read_dict until a pair is found.
"""
read_dict = defaultdict(lambda: [None, None])
for align in bam.fetch(until_eof=True, region=region_string):
    if not align.is_paired:
        continue
    qname = align.query_name
    if qname not in read_dict:
        if align.is_read1:
            read_dict[qname][0] = [align]
        else:
            read_dict[qname][1] = [align]
    elif align not in read_dict[qname][0] and align not in read_dict[qname][1]:
        if align.is_read1:
            read_dict[qname][0].extend(align)
        else:
            read_dict[qname][1].extend(align)
    else:
        if align.is_read1:
            yield read_dict[qname][0], read_dict[qname][1]
        else:
            yield read_dict[qname][0], read_dict[qname][1]
        del read_dict[qname]

I change the name from "read" to "align" since each row is an alignment record instead of a read record. Instead of return one pysam.AlignmentSegment object, I returned a list of pysam.AlignmentSegment objects.

And you can call it like this: for read1_aligns, read2_aligns in read_pair_generator(ex_sam):

In this case, read1_aligns will be a python list including all the pysam.AlignmentSegment objects corresponding to this read1. Pls try this out and see if this works.

ADD COMMENT
0
Entering edit mode
def read_pair_generator(bam, region_string=None):
"""
Generate read pairs in a BAM file or within a region string.
Reads are added to read_dict until a pair is found.
"""
read_dict = defaultdict(lambda: [[], []])
bam.reset()
for align in bam.fetch(until_eof=True, region=region_string):
    if not align.is_proper_pair:
        continue
    qname = align.query_name
    # print(align, type(align))
    # print("current read_dict is:", read_dict)
    print("read1_aligns: ", read_dict[qname][0], type(read_dict[qname][0]))
    print("read2_aligns: ", read_dict[qname][1], type(read_dict[qname][1]))
    if qname not in read_dict.keys():
        align_list = [align]
        # print(align_list, type(align))
        if align.is_read1:
            read_dict[qname][0] = align_list
            # print(read_dict[qname][0], type(read_dict[qname][0]))
        else:
            read_dict[qname][1] = align_list
            # print(read_dict[qname][1], type(read_dict[qname][1]))
    elif align in read_dict[qname][0] or align in read_dict[qname][1]:
        if align.is_read1:
            yield read_dict[qname][0], read_dict[qname][1]
        else:
            yield read_dict[qname][0], read_dict[qname][1]
        del read_dict[qname]
    else:
        if align.is_read1:
            read_dict[qname][0].append(align)
        else:
            read_dict[qname][1].append(align)

Sorry, fixed something. This is the working one.

ADD REPLY

Login before adding your answer.

Traffic: 1007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6