I have extracted from a bam file one read and unclipped the CIGAR string. I have done this with this code
import pysam
# Read IDs from file into a set
ids_to_extract = set()
with open("reads_names.txt", "r") as file:
for line in file:
ids_to_extract.add(line.strip())
# Open the original BAM file and a new file for output
infile = pysam.AlignmentFile("RACP1_4poolv7anddirect_A_FINAL_SORTED.bam", "rb") # Use "rb" for reading a BAM file
outfile = pysam.AlignmentFile("unclipped_simulated_reads.bam", "wb", template=infile)
for read in infile:
if read.query_name in ids_to_extract:
# Get the original CIGAR tuples (operation, length)
original_cigar_tuples = read.cigartuples
if original_cigar_tuples:
# Sum the lengths of match and soft-clip segments
total_match_length = sum(length for operation, length in original_cigar_tuples if operation in [0, 4]) # 0 for 'M', 4 for 'S'
# Other operations (insertions, deletions, etc.) are ignored in this sum
new_cigar_tuples = [(0, total_match_length)] # Replace with a single 'M' operation
# Set the modified CIGAR tuples to the read
read.cigartuples = new_cigar_tuples
# Write the modified read to the output file
outfile.write(read)
infile.close()
outfile.close()
These are the reads. First one is the original second is the extracted and unclipped read:
samtools view RACP1_4poolv7anddirect_A_FINAL_SORTED.bam | grep "MN01972:51:000H5KYKL:1:11104:19018:10292"
MN01972:51:000H5KYKL:1:11104:19018:10292 0 chr2 208248363 60 24S102M * 0 0 CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATGGCTTCTCTGAAGAC AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFF RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A UQ:i:0 AS:i:126
(RACP) [mdb1c20@cyan52 simulation_data]$ samtools view unclipped_simulated_reads.bam | grep "MN01972:51:000H5KYKL:1:11104:19018:10292"
MN01972:51:000H5KYKL:1:11104:19018:10292 0 chr2 208248363 60 126M * 0 0 CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATGGCTTCTCTGAAGAC AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFF RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A UQ:i:0 AS:i:126
What I was expecting is to extract the read as seen in the original bam but when I open both bam in IGV, they differ.
Why??