Question

Position Of Mismatches Per Read From A Sam/Bam File

10

Entering edit mode

12.3 years ago

Arun 2.4k

Hello all,

From a bam (or sam) file, by looking at the MD:Z field, we could identify the position of mismatches. For example, if the cigar string is 45M30N35M with the mismatches occuring at positions 5(A->T) and 65(C->G), then the MD flag would be MD:Z:4A59C15 to reflect the mismatched bases at 5th and 65th position. Of course it gets a bit complicated if there are consecutive mismatches and/or deletions followed by mismatches. If you're interested, this post explains it very well.

What I am interested in is, given the cigar string and MD:Z tag, to obtain the position of mismatches in a vector. In this case it would be 5 and 65. I could implement it myself, but I am half-minded about it (due to time restrictions) and was wondering if any of the already existing R-packages (like GenomicRanges and such) have a way of obtaining this info directly. Are there any packages anyone is aware of?

Thank you very much in advance. And I wish the biostars forum a Merry Christmas and a very happy new year!

rna-seq sam bam • 26k views

ADD COMMENT • link 12.3 years ago by Arun 2.4k

score 2 · Answer 1 · 2012-12-21

UPDATED: moved to https://github.com/lindenb/jvarkit/wiki/Biostar59647

The following code should do the job:

https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar59647.java

Compilation:

 javac -cp path/to/sam.jar:/path/to/picard.jar Biostar59647.java

Execute:

 java -cp path/to/sam.jar:/path/to/picard.jar:. Biostar59647  \
           ~/samtools-0.1.18/examples/toy.fa \
           ~/samtools-0.1.18/examples/toy.bam \
            ref:9-20 | xmllint --format -

Result:

<?xml version="1.0" encoding="UTF-8"?>
<biostar59647 ref="/home/lindenb/samtools-0.1.18/examples/toy.fa" bam="/home/lindenb/samtools-0.1.18/examples/toy.bam">
  <interval chrom="ref" start="9" end="20">
    <read>
      <name>r001</name>
      <sequence>TTAGATAAAGAGGATACTG</sequence>
      <flags>163</flags>
      <chrom>ref</chrom>
      <pos>7</pos>
      <cigar>8M4I4M1D3M</cigar>
      <align>
        <M read-index="1" read-base="T" ref-index="7"/>
        <M read-index="2" read-base="T" ref-index="8"/>
        <M read-index="3" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="4" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="5" read-base="A" ref-index="11" ref-base="A"/>
        <M read-index="6" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="7" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="8" read-base="A" ref-index="14" ref-base="A"/>
        <I read-index="9" read-base="A"/>
        <I read-index="10" read-base="G"/>
        <I read-index="11" read-base="A"/>
        <I read-index="12" read-base="G"/>
        <M read-index="13" read-base="G" ref-index="15" ref-base="G"/>
        <M read-index="14" read-base="A" ref-index="16" ref-base="A"/>
        <M read-index="15" read-base="T" ref-index="17" ref-base="T"/>
        <M read-index="16" read-base="A" ref-index="18" ref-base="A"/>
        <D ref-index="19" ref-base="G"/>
        <M read-index="17" read-base="C" ref-index="20" ref-base="C"/>
        <M read-index="18" read-base="T" ref-index="21"/>
        <M read-index="19" read-base="G" ref-index="22"/>
      </align>
    </read>
    <read>
      <name>r002</name>
      <sequence>AAAAGATAAGGGATAAA</sequence>
      <flags>0</flags>
      <chrom>ref</chrom>
      <pos>9</pos>
      <cigar>1S2I6M1P1I1P1I4M2I</cigar>
      <align>
        <S read-index="1" read-base="A"/>
        <I read-index="2" read-base="A"/>
        <I read-index="3" read-base="A"/>
        <M read-index="4" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="5" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="6" read-base="A" ref-index="11" ref-base="A"/>
        <M read-index="7" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="8" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="9" read-base="A" ref-index="14" ref-base="A"/>
        <I read-index="10" read-base="G"/>
        <I read-index="11" read-base="G"/>
        <M read-index="12" read-base="G" ref-index="15" ref-base="G"/>
        <M read-index="13" read-base="A" ref-index="16" ref-base="A"/>
        <M read-index="14" read-base="T" ref-index="17" ref-base="T"/>
        <M read-index="15" read-base="A" ref-index="18" ref-base="A"/>
        <I read-index="16" read-base="A"/>
        <I read-index="17" read-base="A"/>
      </align>
    </read>
    <read>
      <name>r003</name>
      <sequence>AGCTAA</sequence>
      <flags>0</flags>
      <chrom>ref</chrom>
      <pos>9</pos>
      <cigar>5H6M</cigar>
      <align>
        <M read-index="1" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="2" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="3" read-base="C" ref-index="11" ref-base="A" mismatch="true"/>
        <M read-index="4" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="5" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="6" read-base="A" ref-index="14" ref-base="A"/>
      </align>
    </read>
    <read>
      <name>r004</name>
      <sequence>ATAGCTCTCAGC</sequence>
      <fl

score 0 · Answer 2 · 2012-12-23

Thanks to Pierre for the code. However, I wanted the implementation in R as it will be part of a pipeline. Assuming that you extracted the bam file with NM and MD:Z flags, let's assume for a read, NM = 4 and MD flag = 0G15^GAC0T60T4^AA0C0 (this MD flag has all the combinations I could think of), the code that accomplishes the task of finding the position of mismatches is,

nm      <- 4
md      <- "0G15^GAC0T60T4^AA0C0"
# First crucial step is to split the md string at each substitution/deletion operation
md.gsub <- gsub("([\\^]*[ACGT]+)[0]*", " \\1 ", md)
# split each operation using strsplit
md.spl  <- strsplit(md.gsub, "[ ]+")[[1]]
this    <- as.integer()

# since its a relatively time-consuming operation
# use NM flag to calculate position of mismatches
# only if there are mismatches
if (nm != 0) {
    this <- lapply(md.spl, function(y) { 
        if (!is.na(as.numeric(y))) {
            o <- rep("M", as.numeric(y))
        } else if( length(grep("\\^", y)) > 0) {
            o <- rep("D", nchar(y) - 1)
        } else if (nchar(y) == 1) {
            o <- rep("MM", 1)
        }
    })
    this <- do.call(c, this)
    # after this step, we have a vector of length = 
    # read length and each position telling if its a 
    # match(M), mismatch(MM) or deletion (D)
    this <- which(this == "MM")
}
this
# [1]  1 20 81 88

In case you want to use this on a set of reads for a bam file loaded in R, then you can wrap this around with sapply on each MD tag.

Note: Since we are coercing characters as well to numeric data type with as.numeric, there will be warnings when you run this code. Its not an issue. That's what we expect.