I am creating a custom softclipping tool due to a limitation found in Ampliconclip Issue softclipping reads when they belong and don't belong to a common amplicon.
I am progressing ok in my development and I am developing tests.
I found some limitations when I try to mock a Pysam read.
I show how I am mocking Pysam reads so far
import pytest
from softclip import softclip_positive_read1
# Mock the pysam.AlignedSegment object
class MockRead:
def __init__(self, reference_start, cigartuples, cigarstring):
self.reference_start = reference_start
self.cigartuples = cigartuples
self.cigarstring = cigarstring
self.reference_id = None
self.mapping_quality = None
self.next_reference_id = None
self.next_reference_start = None
self.template_length = None
self.flag = 0
def infer_query_length(self):
# This function calculates the total length of the query sequence in the read,
# considering only certain types of CIGAR operations.
# It iterates over the CIGAR tuples of the read. Each tuple consists of an operation code (op) and a length.
# The CIGAR operations are represented by their codes, such as 0 for 'M', 4 for 'S', etc.
# The function sums the lengths of all operations except for soft clipping (S, op code 4) and hard clipping (H, op code 5).
# Soft and hard clipping operations indicate portions of the read that are not aligned to the reference.
# Therefore, these are excluded from the total length of the aligned part of the query sequence.
# The result is the total aligned length of the query sequence.
return sum(length for op, length in self.cigartuples if op != 4 and op != 5)
def update_cigarstring(self):
# This function updates the CIGAR string of the read based on its CIGAR tuples.
# A dictionary maps CIGAR operation codes to their string representations:
# M (code 0), I (code 1), D (code 2), N (code 3), S (code 4), H (code 5), P (code 6), '=' (code 7), X (code 8).
cigar_ops = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S', 5: 'H', 6: 'P', 7: '=', 8: 'X'}
# The CIGAR string is constructed by iterating over the CIGAR tuples.
# For each tuple, the length and corresponding string representation of the operation are concatenated.
# This results in the reconstructed CIGAR string that reflects the current state of the CIGAR tuples.
# The CIGAR string is then assigned to the cigarstring attribute of the MockRead object.
self.cigarstring = ''.join([f"{length}{cigar_ops[op]}" for op, length in self.cigartuples])
def get_new_counter():
return {
'Total N of reads':0,
'Number of unmapped reads':0,
'Reads in chr outside our ROI':0,
'TOTAL reverse reads' :0,
'Number of selected reverse reads':0,
'TOTAL forwards reads' :0,
'Number of selected forward reads':0,
'Anmapped due S, X or H': 0,
'I or D in F + primer': 0,
'I or D in F - primer': 0,
'Number of + Reads soft-clipped at first primer': 0,
'Number of + Reads soft-clipped at second primer': 0,
}
## Simples tests to understand and check simple scenarios ##
# read aligned to forward complement of the reference sequence
# testing function softclip_positive_read1
# Test function softclip_positive_read1 case by case
def test_softclip_positive_read1_simple_case():
# Initialize a counter dictionary
counter = get_new_counter()
# Create a mock read
read = MockRead(100, [(0, 100)], '100M') # Manually set CIGAR string to '100M'
primer_F_end = 110
primer_R_start = 300 # Becaouse I don't to test the second softclipping here
original_coordinate_start = read.reference_start
print(read.reference_start)
softclip_positive_read1(read, primer_F_end, primer_R_start, counter)
read.update_cigarstring()
print(read.reference_start)
updated_coordinate_start = read.reference_start
assert read.cigarstring == '10S90M', f"Unexpected CIGAR string: {read.cigarstring}"
assert read.reference_start == 110, f"Unexpected reference start: {read.reference_start}"
assert counter['Number of + Reads soft-clipped at first primer'] == 1, "Counter for soft-clipped reads at first primer is incorrect"
expected_length = read.infer_query_length()
actual_length = sum(length for op, length in read.cigartuples if op in [0, 1, 2, 7, 8]) # Summing lengths for M, I, D, =, X
assert expected_length == actual_length, f"Length mismatch: Expected {expected_length}, Actual {actual_length}"
assert updated_coordinate_start - original_coordinate_start == 10
At the end of the code, you can see the first test I have. The scenario is a 100-base long read fully mapped = 100M. The region I want to soft clip ends in position 110 so the first 10 bases should be sofclipped.
Everything is working fine apart from the read.reference_start. For a reason I don't know the test is not able to mimic good enough the read and the read.reference_star doesn't change. So I get an error in the assertion assert read.reference_start == 110, f"Unexpected reference start: {read.reference_start}"
However, in the function, I am testing, I have added a couple of print(read)
to see if the function is working and effectively, I am modifying the start_reference with my softclip_positive_read1() as I want. As shown below
BEFORE MN01972:51:000H5KYKL:1:11102:15603:12743 0 #1 208248339 60 126M * 0 0 AACCTTGG...array('B', [32, 37, 37, 37, 37, 37, 32, 37, 37, 37, 37, 37, 37, 37, 37, 32, 32, 37, 37, 37, 32, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 32, 14, 37, 37, 32, 37, 37, 37, 32, 32, 37, 37, 28, 37, 37, 37, 37, 37, 37, 14, 37, 37, 32, 37, 14, 37, 37, 37, 37, 37, 37, 37, 21, 14, 37, 37, 37, 37, 37, 37, 21, 14, 37, 37, 37, 32, 37, 32, 37, 37, 37, 37, 37, 32, 14, 32, 37, 14, 14, 14, 37, 32, 37, 37, 14, 37, 37, 37, 37, 37, 37, 32, 32, 37, 14, 37, 37, 37, 37, 37, 14, 28, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 32, 37]) [('MD', '126'), ('RG', '230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A'), ('NM', 0), ('UQ', 0), ('AS', 126)]
25
102
AFTER MN01972:51:000H5KYKL:1:11102:15603:12743 0 #1 208248363 60 24S77M25S * 0 0 AACCTTGG... array('B', [32, 37, 37, 37, 37, 37, 32, 37, 37, 37, 37, 37, 37, 37, 37, 32, 32, 37, 37, 37, 32, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 32, 14, 37, 37, 32, 37, 37, 37, 32, 32, 37, 37, 28, 37, 37, 37, 37, 37, 37, 14, 37, 37, 32, 37, 14, 37, 37, 37, 37, 37, 37, 37, 21, 14, 37, 37, 37, 37, 37, 37, 21, 14, 37, 37, 37, 32, 37, 32, 37, 37, 37, 37, 37, 32, 14, 32, 37, 14, 14, 14, 37, 32, 37, 37, 14, 37, 37, 37, 37, 37, 37, 32, 32, 37, 14, 37, 37, 37, 37, 37, 14, 28, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 32, 37]) [('MD', '126'), ('RG', '230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A'), ('NM', 0), ('UQ', 0), ('AS', 126)]
In conclusion, I assume the error is how the mock read is done. What I am doing wrong??
it is a little hard to follow, and I am not sure if this is the problem or not, when it comes to deletions and insertions I think one of them should be skipped when adding up the query length,
depending on what length is being computed, if it is an alignment length, then the insertion is not part of the aligned region length, if it is the query length then the deletion is not present in the query
so beyond the
S
andH
you have to properly account forD
andI