Hi, I'm editing the header, RNAME, POS, and PNEXT of a bam file so that I can transfer read mappings on individual CDS's directly to the complete genome containing those CDS's. To do that, I am using pysam to iterate through a bam file alignment-by-alignment and then building a new alignment to be written to a new bam file. The python script (below) runs fine, and produces an output bam file. However, when I try to view the bamfile with samtools, samtools gets to the 5351 line just fine, and then it says that there is a segmentation fault. I can't figure out why or what I've done wrong in my python script to make a faulty bam file. Any help would be greatly appreciated! Thank you in advance!
Here is the output bam file that causes the problem (trimmed to the 5352 line to reduce size and still produces the error): out.bam
Here is the output from a gdb backtrace (gdb samtools; run view out.bam; bt
). I have no idea how to use this information (would love to learn how!):
Program received signal SIGSEGV, Segmentation fault.
strlen () at ../sysdeps/x86_64/strlen.S:106
106 movdqu (%rax), %xmm4
(gdb) bt
0 strlen () at ../sysdeps/x86_64/strlen.S:106
1 0x0000000000422733 in kputs (s=0x7fffffffc420, p=0x0) at kstring.h:61
2 bam_format1_core (header=0x67a2e0, b=b@entry=0x67a430, of=<optimized out>) at bam.c:278
3 0x0000000000427fde in samwrite (fp=fp@entry=0x67a2c0, b=b@entry=0x67a430) at sam.c:137
4 0x0000000000408745 in main_samview (argc=2, argv=0x7fffffffcac0) at sam_view.c:208
5 0x00007ffff70c9731 in __libc_start_main (main=0x402660 <main>, argc=3, argv=0x7fffffffcab8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffcaa8) at ../csu/libc-start.c:289
6 0x0000000000402c2d in _start ()
Here is the python script that I use to produce the output bam file:
from Bio import SeqIO
import pysam, time
bamIn = pysam.AlignmentFile("test.bam", "rb")
gb = SeqIO.parse("test.gb", "genbank")
gbInfo = []
header = bamIn.header
header['SQ'] = []
recCount = 0
for rec in gb:
header['SQ'].append({'LN': len(rec), 'SN': rec.id})
for feat in rec.features:
if (feat.type == "CDS" and "pseudo" not in feat.qualifiers.keys()):
if feat.strand == 1:
featInfo = [recCount,int(feat.location.start) + 2]
elif feat.strand == -1:
featInfo = [recCount,int(feat.location.end) + 2]
gbInfo.append(featInfo)
recCount = recCount + 1
with pysam.AlignmentFile("out.sam", "w", header = header) as bamOut:
for aln in bamIn:
alnOut = aln
feat_num = aln.reference_id
alnOut.reference_id = gbInfo[feat_num][0]
alnOut.reference_start = aln.reference_start + gbInfo[feat_num][1]
alnOut.pnext = aln.pnext + gbInfo[feat_num][1]
bamOut.write(alnOut)
bamIn.close()
Thanks to John Marshall's help with interpreting the samtools segmentation fault, I was able to fix the error. Here is the code that I ended up using to put reads mapped to individual bacterial/phage genes from a multifasta file (a transcriptome bam file from kallisto) onto the bacterial/phage genome fastas from which the genes were taken (this is reinventing the wheel in a sense, since rsem-tbam2gbam and tophat - Transcriptomic Coordinate Bam To Genomic Coordinate Bam - already have utilities to do this, and is limited to bacterial CDS's):