C Api For Bam: Iterating Over A Cigar String.
2
4
Entering edit mode
14.0 years ago

Hi, this question is somehow related to this previous question. I'm playing with the C API for BAM and I wrote the following code: https://gist.github.com/736059

I'm now trying to find the indexes of the genomic bases covered by the CIGAR string (my final goal is to create a WIG file containing the coverage of the genome)

What would be the correct way to change my code to get the genomic indexes of the bases covered by each CIGAR element ?

(...)
for( k=0;k< b->core.n_cigar;++k)
{
int cop =cigar[k] & BAM_CIGAR_MASK; // operation
int cl = cigar[k] >> BAM_CIGAR_SHIFT; // length
switch(cop)
 {
 case BAM_CMATCH: printf("M");break;
 case BAM_CINS: printf("I");break;
 case BAM_CDEL: printf("D");break;
 case BAM_CREF_SKIP: printf("N"); break;
 case BAM_CSOFT_CLIP: printf("S");break;
 case BAM_CHARD_CLIP: printf("R");break;
 case BAM_CPAD: printf("P");break;
 default:printf("?");break;
 }
printf("%d",cl);
}
(...)

Thanks

cigar bam wiggle coverage • 6.2k views
ADD COMMENT
2
Entering edit mode

Have you looked at bamtools: https://github.com/pezmaster31/bamtools ? It provides a very clean C++ BAM API.

ADD REPLY
0
Entering edit mode

(That noted, it won't help make this any easier.)

ADD REPLY
0
Entering edit mode

Yes thanks, I saw it, but I want to learn the 'official' API and how the data should be handled

ADD REPLY
4
Entering edit mode
14.0 years ago
lh3 33k

For example

int depth[REF_LEN], x, j, k;
uint32_t *cigar = bam1_cigar(b);
for (k = 0, x = b->core.pos; k < b->core.n_cigar; ++k) {
  int op = cigar[k]&16;
  int l = cigar[k]>>4;
  if (op == BAM_CMATCH) {
    for (j = x; j < x + l; ++j) ++depth[j];
    x += l;
  } else if (op == BAM_CREF_SKIP || op == BAM_CDEL) x += l;
}

Something like this...

ADD COMMENT
0
Entering edit mode

While this does work for an unsorted file, on a sorted file I think a progressive approach (ie. one that does not build a genome size array) would be preferable.

ADD REPLY
0
Entering edit mode

That calDepth.c only aims to exemplify how to use pileup APIs. For depth only, the code block above is the fastest and easiest. Given a sorted file, you can allocate an array per chromosome and then free it. The peak memory is only about 1GB.

ADD REPLY
1
Entering edit mode
14.0 years ago

Short answer, see samtools/examples/calDepth.c for a pileup based solution. The CIGAR is resolved within the pileup, avoiding the need to walk the cigar string yourself.

Still, to answer the direct question - 'get the genomic indexes of the bases covered by each CIGAR element', I would implement that switch statement as:

/* NB: Closed ranges */
switch (cop) {
 case BAM_CMATCH:
      printf("M");
      printf("[%d-%d]", pos, pos + cl - 1);
      pos+=cl;
      break;

 case BAM_CHARD_CLIP:
      printf("H");
      /* printf("[%d]", pos);  // No coverage */
      /* pos is not advanced by this operation */
      break;

 case BAM_CSOFT_CLIP:
      printf("S");
      /* printf("[%d]", pos);  // No coverage */
      /* pos is not advanced by this operation */
      break;

 case BAM_CDEL:
      printf("D");
      /* printf("[%d-%d]", pos, pos + cl - 1);  // Spans positions, No Coverage */
      pos+=cl;
      break;

 case BAM_CPAD:
      printf("P");
      /* printf("[%d-%d]", pos, pos + cl - 1); // Spans positions, No Coverage */
      pos+=cl;
      break;

 case BAM_CINS:
      printf("I");
      /* printf("[%d]", pos); // Special case - adds <cl> bp "throughput", but not genomic position "coverage" */
      /* How you handle this is application dependent */
      /* pos is not advanced by this operation */
      break;

 case BAM_CREF_SKIP:
      printf("S");
      /* printf("[%d-%d]", pos, pos + cl - 1); /* Spans positions, No Coverage */
      pos+=cl;
      break;

 default:
      fprintf(stderr, "Unhandled cigar_op %d:%d\n", cop, cl);
      printf("?");
}
ADD COMMENT
0
Entering edit mode

That's what I was looking for ! Many thanks !

ADD REPLY
0
Entering edit mode

James Bonfield also posted an alternative pileup implementation (with a usable switch/case block), but with some subtly different edge cases, to the mailing list recently. It seems to be omitted from the archives, but was called sam_depth.c.gz, and was the first message in this thread: http://sourceforge.net/mailarchive/message.php?msg_id=26502507

ADD REPLY

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6