Question

find the mate of a read using pysam

0

Entering edit mode

23 months ago

derealme • 0

Hi,

I know it's been asked before though a few years ago but I can't seem to find a good answer.

I need to find the mate of a read - originally I've been using alignment.mate(r) but this is super slow and not feasible to use in scale. So I'm looking for alternatives. I've seen people suggesting to use the query_name, but looking at the query names of my read1/read2 that doesn't seem to work either.

What is the recommended way of getting the mate read for an alignment?

Thanks.

alignment pysam • 2.5k views

ADD COMMENT • link updated 20 months ago by Walter ▴ 10 • written 23 months ago by derealme • 0

0

Entering edit mode

Are the BAM/SAMs name sorted? Finding mates in a coordinate sorted file will be slow regardless of the method.

ADD REPLY • link 23 months ago by rpolicastro 13k

0

Entering edit mode

it is sorted, yes.

ADD REPLY • link 23 months ago by derealme • 0

0

Entering edit mode

what is your goal in finding the mate of the read? it may guide further answers e.g. if you just need to find out how far away it is, can use RNEXT/PNEXT fields instead of actually finding the read.

ADD REPLY • link 23 months ago by cmdcolin ★ 4.0k

0

Entering edit mode

I need the read itself - i.e. the corresponding AlignedSegment

ADD REPLY • link 23 months ago by derealme • 0

0

Entering edit mode

name sorting the BAM file is one option, as the pairs share the same name so name sorting the file will put the pairs next to each other. another is https://github.com/ssadedin/bazam which works on coordinate sorted, but may or may not mesh with your python workflow

ADD REPLY • link 23 months ago by cmdcolin ★ 4.0k

score 1 · Answer 1 · 2023-03-30

I just came across this problem, and this is the solution I'm using right now. I'm sharing it here as it is one of the first and most recent posts that comes after googling.

My use case is as follows:

I have a list of positions I need to analyze
I need to get all the short reads in each position, so I need to use a (position) sorted and indexed .bam file
For each short read in each position, I need the mate information (location, sequence, mapping quality)
As OP said, the AlignmentFile.mate(r) method is too slow.
I also tried using a bam ordered by name without success.

My current solution, inspired by this post

I'm using the IndexedReads class from the pysam library.

Index a Sam/BAM-file by query name while keeping the original sort order intact.

After indexing, you can use it to access the mates much quicker than with a positioned ordered .bam file. In my system, in order to get the mates of 200 short reads (200 short reads is the number of reads I get at a particular position) it takes:

Around 9 seconds to get the 200 mates using AlignmentFile.mate(r)
Around 0.26 seconds to get the 200 mates using the the index.

This is definitely and improvement, but using the index has some overhead. For a .bam file of around 50GB (full genome), it requires on my system:

51 minutes to build the index
130 GB of RAM

In order to skip the index building every time you run your script, the obvious solution is to save it to disk. Unfortunately pysam (that I'm aware of) does not offer this option, and neither does samtools. I was not able to find a solution to save the whole IndexedReads class from the outside, so we have to modify the pysam library, as suggested in the PR 1037

The PR adds a load() and store() methods to the IndexedReads class. Some numbers (using pickle, not json as in the PR)

Time to store the index: 13 minutes
Time to load the index: 16 minutes
RAM after loading: 150 GB (for some reason it is higher than after building, I guess some pickle object is not cleared correctly).

Unfortunately the loading step still takes some time (keep in mind my system has an Hard Disk, so with an SSD it should improve.)

From some quick math, for this method to be beneficial I would need to process more than 114 breakpoints, of 200 short reads each (which is my case).

The solution is not perfect, but it may help in the following conditions:

You have enough RAM available.
You have to get the mates from a lot of short reads.
You plan on running the script multiple times.

Some snippets of code that may help:

Build, load and store index:

self.shorts: AlignmentFile = AlignmentFile(shorts_file, "rb")
self.shorts_name_idx: IndexedReads = IndexedReads(self.shorts, True)

shorts_name_idx_file: str = shorts_file.replace('.bam',
                                                '_name_idx.pickle')

if path.exists(shorts_name_idx_file):
    self.log_inf.info(f"{shorts_name_idx_file} found, load")
    self.shorts_name_idx.load(shorts_name_idx_file)
    self.log_inf.info(f"loaded {shorts_name_idx_file}")
else:
    self.log_inf.info(f"{shorts_name_idx_file} not found, build")
    self.shorts_name_idx.build()
    self.shorts_name_idx.store(shorts_name_idx_file)
    self.log_inf.info(f"built {shorts_name_idx_file}")

Iterate reads in a position and get the mate using the index

for read in self.bam.fetch(self.chrom, self.pos)
    current_pos: int = self.shorts.tell()
    try:
        found: IteratorRow = self.shorts_name_idx.find(read.query_name)

    except KeyError as e:
        print(f"Can not get reads with name {read.query_name}: {e}")

    else:
        mate: AlignedSegment = None
        for found_read in found:
            if (found_read.is_read1 and read.is_read2) or \
                    (found_read.is_read2 and read.is_read1):
                mate = found_read

        if mate is not None:
            print(mate.query_sequence)
    finally:
        # For some reason we need to reset the iterator, even if the 
        # IndexedReads is supposed to use a different iterator.
        self.shorts.seek(current_pos)

The patch that I'm using for pysam:

diff --git a/pysam/libcalignmentfile.pyi b/pysam/libcalignmentfile.pyi
index 74637f8..b3694d8 100644
--- a/pysam/libcalignmentfile.pyi
+++ b/pysam/libcalignmentfile.pyi
@@ -235,3 +235,5 @@ class IndexedReads:
     ) -> None: ...
     def build(self) -> None: ...
     def find(self, query_name: str) -> IteratorRow: ...
+    def store(self, filename: str) -> None: ...
+    def load(self, filename: str) -> None: ...
diff --git a/pysam/libcalignmentfile.pyx b/pysam/libcalignmentfile.pyx
index 1ee799b..7c75655 100644
--- a/pysam/libcalignmentfile.pyx
+++ b/pysam/libcalignmentfile.pyx
@@ -64,6 +64,8 @@ except ImportError:
 import re
 import warnings
 import array
+import pickle
+
 from libc.errno  cimport errno, EPIPE
 from libc.string cimport strcmp, strpbrk, strerror
 from libc.stdint cimport INT32_MAX
@@ -2928,6 +2930,16 @@ cdef class IndexedReads:
             self.header = samfile.header
             self.owns_samfile = False

+    def store(self, filename):
+        '''Save the index to disk, in the specified filename, using pickle dump.'''
+        with open(filename, 'wb') as f:
+            pickle.dump(self.index, f)
+
+    def load(self, filename):
+        '''Load the index from disk, in the specified filename, using pickle load.'''
+        with open(filename, 'rb') as f:
+            self.index = pickle.load(f)
+
     def build(self):
         '''build the index.'''

In the patch I'm using pickle instead of json, as it gives a smaller index size. The patch is based on commit 8463ba474dfe.

To install the patched pysam in your env, you can:

Clone the pysam repo and move to the folder
Apply the patch
Install cython with pip install cython
Install pysam using the setup file with python setup.py install