I am writing my first program that uses the SAMtools API, and I am running into an error. I have created a simple example program that isolates the error, which is shown below. Essentially, when I run the program on a .sam file generated by bwa aln and bwa samse, I get a warning message about a missing CIGAR sequence, and then an error message about the sequence and quality being inconsistent.
Is this a problem with my code or with the data?
Here is the stripped-down program....
#include <stdio.h>
#include "sam.h"
int main(int argc, char **argv)
{
samfile_t *sam = samopen(argv[1], "r", NULL);
if(sam == NULL)
{
fprintf(stderr, "file open error: '%s'\n", argv[1]);
return 1;
}
bam_header_t *header = sam_header_read(sam->x.tamr);
if(header == NULL)
{
fprintf(stderr, "SAM header parse error\n");
return 1;
}
bam1_t *alignment = bam_init1();
int bytesread;
while((bytesread = samread(sam, alignment)) > 0)
{
if(alignment->core.flag & 4)
{
// unmapped; ignore
continue;
}
char *seqid = header->target_name[alignment->core.tid];
unsigned long left = alignment->core.pos + 1;
unsigned long right = bam_calend(&alignment->core, bam1_cigar(alignment)) + 1;
char strand = (alignment->core.flag & 16) ? '+' : '-';
printf("%s[%lu, %lu]%c\n", seqid, left, right, strand);
}
bam_header_destroy(header);
bam_destroy1(alignment);
samclose(sam);
return 0;
}
....the first few lines of the .sam file...
@SQ SN:chr1 LN:249250621
HWUSI-EAS566_0006:1:1:1018:13259#0 0 chr1 162546618 37 26M * 0 0 GGCAGGGGACAAGACTCGGCGTTGCA e\cT\`bbT\]Y`Y^^a^^c]\L_]` XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:26
HWUSI-EAS566_0006:1:1:1019:11220#0 4 * 0 0 * * 0 0 AGAAAGAGCTCATGCAGCGTCTTAAG ab\a\b^ccacaaaL]]]]Ra^^^^a
HWUSI-EAS566_0006:1:1:1025:16447#0 16 chr1 183604878 25 26M * 0 0 ACTGATTCCCTTCCCTCTTCCGCTCT \_Y]\Y^^L^^Xa`a`^_JacL\_b^ XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0C7A17
HWUSI-EAS566_0006:1:1:1026:12890#0 4 * 0 0 * * 0 0 TGGCTAAGAGGGAGTGGGTGTTGCGG cccc`T]`Y`\\O]Wb\b_b^caLc\
HWUSI-EAS566_0006:1:1:1027:12527#0 4 * 0 0 * * 0 0 CCTTCTCTCTTCCTCGGCGCTGCCTA dbd`dad^bda^^b\^acaaY__T\`
HWUSI-EAS566_0006:1:1:1028:10989#0 4 * 0 0 * * 0 0 AGTTCCGGCGCAGCCGGGATTTGGGT \ddddccLacddcdadd`[_Y]__]^
HWUSI-EAS566_0006:1:1:1031:18950#0 4 * 0 0 * * 0 0 GTAATGGCTTCAAGGACTATCAATGC cdYLccTbcb`^\bTHU]_Qbb\T`\
HWUSI-EAS566_0006:1:1:1037:2137#0 4 * 0 0 * * 0 0 GGCTTGCTAAGCAGAGGCCGGAAGCG eYeeea`^bb_\aY\_d_daY^^^^_
HWUSI-EAS566_0006:1:1:1037:19136#0 4 * 0 0 * * 0 0 ATATCTGCTTCCGACACAGCTGCAAT aT\YT_Yb_Y\aa``TY`]YT^\Y`a
...and the error that gets generated.
[standage@lappy ~] gcc -Wall -I /path/to/samtools/ -L /path/to/samtools/ -o test test.c -lbam -lz
[standage@lappy ~] ./test test.sam
[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '162546618' is recognized as '*'.
Parse warning at line 2: mapped sequence without CIGAR
Parse error at line 2: sequence and quality are inconsistent
Abort trap: 6
[standage@lappy ~]
compile with '-g', and run the GNU debuger
The sam_read1 message seems to indicate that '162546618' is being interpreted as the reference sequence name, when it's clear that this is the position of the alignment.
Maybe using the BAMtools(c++) or Picard (Java) API is a bit more user-friendly (and documented).