Question

Need help interpreting gdb backtrace from a SAMtools segmentation fault (or help editing SAM/BAM files!)

0

Entering edit mode

8.0 years ago

russianconcussion ▴ 30

Hi, I'm editing the header, RNAME, POS, and PNEXT of a bam file so that I can transfer read mappings on individual CDS's directly to the complete genome containing those CDS's. To do that, I am using pysam to iterate through a bam file alignment-by-alignment and then building a new alignment to be written to a new bam file. The python script (below) runs fine, and produces an output bam file. However, when I try to view the bamfile with samtools, samtools gets to the 5351 line just fine, and then it says that there is a segmentation fault. I can't figure out why or what I've done wrong in my python script to make a faulty bam file. Any help would be greatly appreciated! Thank you in advance!

Here is the output bam file that causes the problem (trimmed to the 5352 line to reduce size and still produces the error): out.bam

Here is the output from a gdb backtrace (gdb samtools; run view out.bam; bt). I have no idea how to use this information (would love to learn how!):

Program received signal SIGSEGV, Segmentation fault.
strlen () at ../sysdeps/x86_64/strlen.S:106
106     movdqu  (%rax), %xmm4
(gdb) bt
0  strlen () at ../sysdeps/x86_64/strlen.S:106

1  0x0000000000422733 in kputs (s=0x7fffffffc420, p=0x0) at kstring.h:61

2  bam_format1_core (header=0x67a2e0, b=b@entry=0x67a430, of=<optimized out>) at bam.c:278

3  0x0000000000427fde in samwrite (fp=fp@entry=0x67a2c0, b=b@entry=0x67a430) at sam.c:137

4  0x0000000000408745 in main_samview (argc=2, argv=0x7fffffffcac0) at sam_view.c:208

5  0x00007ffff70c9731 in __libc_start_main (main=0x402660 <main>, argc=3, argv=0x7fffffffcab8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffcaa8) at ../csu/libc-start.c:289

6  0x0000000000402c2d in _start ()

Here is the python script that I use to produce the output bam file:

from Bio import SeqIO
import pysam, time

bamIn = pysam.AlignmentFile("test.bam", "rb")

gb = SeqIO.parse("test.gb", "genbank")

gbInfo = []

header = bamIn.header
header['SQ'] = []
recCount = 0

for rec in gb:
    header['SQ'].append({'LN': len(rec), 'SN': rec.id})
    for feat in rec.features:
        if (feat.type == "CDS" and "pseudo" not in feat.qualifiers.keys()):
            if feat.strand == 1:            
                featInfo = [recCount,int(feat.location.start) + 2]
            elif feat.strand == -1:
                featInfo = [recCount,int(feat.location.end) + 2]
            gbInfo.append(featInfo)
    recCount = recCount + 1

with pysam.AlignmentFile("out.sam", "w", header = header) as bamOut:
    for aln in bamIn:
        alnOut = aln
        feat_num = aln.reference_id
        alnOut.reference_id = gbInfo[feat_num][0]
        alnOut.reference_start = aln.reference_start + gbInfo[feat_num][1]
        alnOut.pnext = aln.pnext + gbInfo[feat_num][1]
        bamOut.write(alnOut)

bamIn.close()

bam samtools pysam • 2.5k views

ADD COMMENT • link 8.0 years ago by russianconcussion ▴ 30

0

Entering edit mode

Thanks to John Marshall's help with interpreting the samtools segmentation fault, I was able to fix the error. Here is the code that I ended up using to put reads mapped to individual bacterial/phage genes from a multifasta file (a transcriptome bam file from kallisto) onto the bacterial/phage genome fastas from which the genes were taken (this is reinventing the wheel in a sense, since rsem-tbam2gbam and tophat - Transcriptomic Coordinate Bam To Genomic Coordinate Bam - already have utilities to do this, and is limited to bacterial CDS's):

from sys import argv

from Bio import SeqIO
from Bio.Seq import Seq

import pysam

gbfile = argv[1]
infile = argv[2]
outfile = argv[3]

bamIn = pysam.AlignmentFile(infile, "rb")

gb = SeqIO.parse(gbfile, "genbank")

gbInfo = []

header = bamIn.header
header['SQ'] = []
recCount = 0

#save information from the genbank file that is needed for modifying the BAM file
for rec in gb:
    header['SQ'].append({'LN': len(rec), 'SN': rec.id})
    for feat in rec.features:
        if (feat.type == "CDS" and "pseudo" not in feat.qualifiers.keys()):
            if feat.strand == 1:
                featInfo = [recCount, int(feat.location.start), feat.strand]
            if feat.strand == -1:
                featInfo = [recCount, int(feat.location.end), feat.strand]
            gbInfo.append(featInfo)
    recCount = recCount + 1


with pysam.AlignmentFile(outfile, "wb", header = header) as bamOut:
    for aln in bamIn:
        if( not(aln.is_unmapped) ):
            #Use the existing alignment as a template and edit elements of the alignment as needed
            alnOut = aln
            #Reverse-complement the sequence, reposition the read, and reverse the CIGAR for reads mapping to features on the negative strand
            #Modify the read itself 1st
            if gbInfo[aln.reference_id][2] == -1:
                #if the read maps to a feature on the reverse strand, then reverse complement the read sequence
                sequence = Seq(aln.query_sequence)
                revcomp = str(sequence.reverse_complement())
                alnOut.query_sequence = revcomp
                #adjust the leftmost position of the read if the beginning of the read is deleted or hard/soft clipped
                if aln.cigartuples[0][0] in [2,4,5]:
                    alnOut.reference_start = gbInfo[aln.reference_id][1] - aln.reference_start - len(aln.query_sequence) + aln.cigartuples[0][1]
                else:
                    alnOut.reference_start = gbInfo[aln.reference_id][1] - aln.reference_start - len(aln.query_sequence)
                #if the read maps to a feature on the reverse strand, then reverse the CIGAR code
                alnOut.cigartuples = alnOut.cigartuples[::-1]
                #change the sign on the insert length to indicate that the reads have swapped order
                alnOut.template_length = -1*aln.template_length
                #reverse the order of the quality string
                if alnOut.query_alignment_qualities:
                    alnOut.query_alignment_qualities[::-1]
            #Modify the paired read 2nd
            if gbInfo[aln.next_reference_id][2] == -1:
                alnOut.next_reference_start = gbInfo[aln.reference_id][1] - aln.reference_start - aln.template_length
            #Reposition reads mapping to features on the positive strand
            if gbInfo[aln.reference_id][2] == 1:
                alnOut.reference_start = gbInfo[aln.reference_id][1] + aln.reference_start
            if gbInfo[aln.next_reference_id][2] == 1:
                alnOut.next_reference_start = gbInfo[aln.reference_id][1] + aln.next_reference_start
            #Reassign the reference sequences
            alnOut.reference_id = gbInfo[aln.reference_id][0]
            alnOut.next_reference_id = gbInfo[aln.next_reference_id][0]
            bamOut.write(alnOut)

bamIn.close()

ADD REPLY • link 8.0 years ago by russianconcussion ▴ 30

score 3 · Accepted Answer · 2016-12-08

I'm sure there are some GDB tutorials around that will help you learn to use this debugger…

The interesting thing here is to examine the record that samwrite is trying to print out. The GDB commands up / up / print *b show

$1 = {core = {tid = 0, pos = 15852, bin = 4681, qual = 255, l_qname = 40, 
    flag = 163, n_cigar = 2, l_qseq = 100, mtid = 2, mpos = 15942, 
    isize = 259}, l_aux = 4, data_len = 202, m_data = 256, 
  data = 0x683460 "K00114:371:HCGVNBBXX:6:1109:10927:10159"}

which gives us the QNAME of the read in question. Examination of the interesting crashing source code line (bam.c:278, in bam_format1_core) shows us that the problem occurs printing out the RNEXT (MATE_RNAME) field. This is because your BAM file has only two @SQ headers, but mtid for this record is 2. Valid values would be 0 or 1.

So I guess your script is not being careful enough in setting up this field when you edit the records. Or your output @SQ headers do not correctly reflect the intended chromosomes in your output file.

The particular source code line numbers indicate that you are using samtools 0.1.18, which is prehistoric at this point. Current samtools checks that BAM files' tid and mtid fields (i.e. RNAME and RNEXT) are valid, and in this scenario produces an error message (admittedly a somewhat obscure message) rather than crashing.