I just came across this problem, and this is the solution I'm using right now. I'm sharing it here as it is one of the first and most recent posts that comes after googling.
My use case is as follows:
- I have a list of positions I need to analyze
- I need to get all the short reads in each position, so I need to use a (position) sorted and indexed .bam file
- For each short read in each position, I need the mate information (location, sequence, mapping quality)
- As OP said, the
AlignmentFile.mate(r)
method is too slow.
- I also tried using a bam ordered by name without success.
My current solution, inspired by this post
I'm using the IndexedReads
class from the pysam library.
Index a Sam/BAM-file by query name while keeping the original sort order intact.
After indexing, you can use it to access the mates much quicker than with a positioned ordered .bam file. In my system, in order to get the mates of 200 short reads (200 short reads is the number of reads I get at a particular position) it takes:
- Around 9 seconds to get the 200 mates using
AlignmentFile.mate(r)
- Around 0.26 seconds to get the 200 mates using the the index.
This is definitely and improvement, but using the index has some overhead. For a .bam file of around 50GB (full genome), it requires on my system:
- 51 minutes to build the index
- 130 GB of RAM
In order to skip the index building every time you run your script, the obvious solution is to save it to disk. Unfortunately pysam (that I'm aware of) does not offer this option, and neither does samtools. I was not able to find a solution to save the whole IndexedReads
class from the outside, so we have to modify the pysam library, as suggested in the PR 1037
The PR adds a load()
and store()
methods to the IndexedReads
class. Some numbers (using pickle, not json as in the PR)
- Time to store the index: 13 minutes
- Time to load the index: 16 minutes
- RAM after loading: 150 GB (for some reason it is higher than after building, I guess some pickle object is not cleared correctly).
Unfortunately the loading step still takes some time (keep in mind my system has an Hard Disk, so with an SSD it should improve.)
From some quick math, for this method to be beneficial I would need to process more than 114 breakpoints, of 200 short reads each (which is my case).
The solution is not perfect, but it may help in the following conditions:
- You have enough RAM available.
- You have to get the mates from a lot of short reads.
- You plan on running the script multiple times.
Some snippets of code that may help:
- Build, load and store index:
self.shorts: AlignmentFile = AlignmentFile(shorts_file, "rb")
self.shorts_name_idx: IndexedReads = IndexedReads(self.shorts, True)
shorts_name_idx_file: str = shorts_file.replace('.bam',
'_name_idx.pickle')
if path.exists(shorts_name_idx_file):
self.log_inf.info(f"{shorts_name_idx_file} found, load")
self.shorts_name_idx.load(shorts_name_idx_file)
self.log_inf.info(f"loaded {shorts_name_idx_file}")
else:
self.log_inf.info(f"{shorts_name_idx_file} not found, build")
self.shorts_name_idx.build()
self.shorts_name_idx.store(shorts_name_idx_file)
self.log_inf.info(f"built {shorts_name_idx_file}")
- Iterate reads in a position and get the mate using the index
for read in self.bam.fetch(self.chrom, self.pos)
current_pos: int = self.shorts.tell()
try:
found: IteratorRow = self.shorts_name_idx.find(read.query_name)
except KeyError as e:
print(f"Can not get reads with name {read.query_name}: {e}")
else:
mate: AlignedSegment = None
for found_read in found:
if (found_read.is_read1 and read.is_read2) or \
(found_read.is_read2 and read.is_read1):
mate = found_read
if mate is not None:
print(mate.query_sequence)
finally:
# For some reason we need to reset the iterator, even if the
# IndexedReads is supposed to use a different iterator.
self.shorts.seek(current_pos)
- The patch that I'm using for pysam:
diff --git a/pysam/libcalignmentfile.pyi b/pysam/libcalignmentfile.pyi
index 74637f8..b3694d8 100644
--- a/pysam/libcalignmentfile.pyi
+++ b/pysam/libcalignmentfile.pyi
@@ -235,3 +235,5 @@ class IndexedReads:
) -> None: ...
def build(self) -> None: ...
def find(self, query_name: str) -> IteratorRow: ...
+ def store(self, filename: str) -> None: ...
+ def load(self, filename: str) -> None: ...
diff --git a/pysam/libcalignmentfile.pyx b/pysam/libcalignmentfile.pyx
index 1ee799b..7c75655 100644
--- a/pysam/libcalignmentfile.pyx
+++ b/pysam/libcalignmentfile.pyx
@@ -64,6 +64,8 @@ except ImportError:
import re
import warnings
import array
+import pickle
+
from libc.errno cimport errno, EPIPE
from libc.string cimport strcmp, strpbrk, strerror
from libc.stdint cimport INT32_MAX
@@ -2928,6 +2930,16 @@ cdef class IndexedReads:
self.header = samfile.header
self.owns_samfile = False
+ def store(self, filename):
+ '''Save the index to disk, in the specified filename, using pickle dump.'''
+ with open(filename, 'wb') as f:
+ pickle.dump(self.index, f)
+
+ def load(self, filename):
+ '''Load the index from disk, in the specified filename, using pickle load.'''
+ with open(filename, 'rb') as f:
+ self.index = pickle.load(f)
+
def build(self):
'''build the index.'''
In the patch I'm using pickle instead of json, as it gives a smaller index size. The patch is based on commit 8463ba474dfe.
To install the patched pysam in your env, you can:
- Clone the pysam repo and move to the folder
- Apply the patch
- Install
cython
with pip install cython
- Install pysam using the
setup
file with python setup.py install
Are the BAM/SAMs name sorted? Finding mates in a coordinate sorted file will be slow regardless of the method.
it is sorted, yes.
what is your goal in finding the mate of the read? it may guide further answers e.g. if you just need to find out how far away it is, can use RNEXT/PNEXT fields instead of actually finding the read.
I need the read itself - i.e. the corresponding AlignedSegment
name sorting the BAM file is one option, as the pairs share the same name so name sorting the file will put the pairs next to each other. another is https://github.com/ssadedin/bazam which works on coordinate sorted, but may or may not mesh with your python workflow