Samtools mpileup - I'm getting different number of base calls compared to number of quality scores
0
0
Entering edit mode
3.1 years ago

Edit: Summary: In the mpileup file, I have different number of quality scores than base calls. For example, I might have 342 reads covering a position, but there will be 340 base calls. This makes parsing the mpileup file difficult

Edit: I should note that the mpileup I'm using was generated with samtools mpileup (version 1.13) and not bcftools mpileup. samtools was giving me all the data from the reads, whereas bcftools was not. So I decided to go with samtools

Hi,

I'm trying to do some variant calling using the mpileup generated from samtools. After reading the documentation for samtools mpileup (http://www.htslib.org/doc/samtools-mpileup.html), I wrote the following R code (see bottom) to try to parse the file. This takes in a reference base, the base calls, and the quality scores and returns all possible mutations. For example, one of the lines is

chrM    108     A       373     ......$.............,.......,,,,,,...,..,,,,,,..................,....,,....,,,....,,,..................................................,,,,,,...........................................,,....................,...................,,,,,,.......,,.................,,,,,,.,.................,.....,,CC.C..,,,,,,,,,,,.,,,....................,,,..,,.,,...................,,,,,,,^].^].^].^].^].^],      AAJJF:JJssJs<JsJJJsJJaJJskjsJJJsJJJJossssJJJJJssJJJJJFoeFsAJsFsssJsJsJJsJJsssJsJssssJJJJJjsJJJJJJJJJJJ<JJJJJJFJJJJJJJJJJJFkJJsssJFJJJJsossssJJJssJJJeJJJJJsJJJJFJJJJJJJJJJJFJJjJJJJJJJssJJJJJJJJJJJJFJJJFJsJJsJJJFJsJJJJJJJJJJJJJsAssssJJJJJssssJJJJjAesJsJJJJJJJsssssoksskJJJsFJJJJJkJJJJsJJJsJsoJJJJJJoossssssssssssoJF<FJJJFssJJJJFJJJJJssssJssJkB;;;;;;;;;;;UU;;;;;;UUUUUUUYY===Y

And after running my function, I get that there are 3 reads that call a mutation at this position

Calls  Quality
C      74
C      74
C      74

They all call the same mutation. This works just fine. There are 373 base calls in the above string, and there are 373 quality scores.

However, here's an example that doesn't work in the mpileup file

chrM    105     T       340     .$.$.$.$.................,.......,,,,,,..,..,,,,,,..................,....,,.....,,,....,,,...-1A..............................................,,,,,,...................................-1A........,,....................,...................,,,,,,.......,,.................,,,,,,.,.................,.....,,......,,,,,,,,,,,.,,,.....................,,,..,,  EEEEJFFJFFssJsJJsJJJsJJFJJsJssjJJsJJJsssVsJJJJJsfJJJ7JFsfAs<JsJsssJsJsJJsFJ<sssFsJsossJJFJJJ7sJJJJJJJJJJ<JFJJ<FJ7JJJ7JJJJJF<JJsssJJJJJFssssssJJFssJJJoJJJJJsJJJJJFJJJJJJFJJJ7JFJeJJJJJJssJJFJJJ<FJJJFJJJJJFsJJsFJJJJsJJJJFJJJJJJJJoJssssJJJJJsssoJFJJJJVsJsJJJJJJJsssssoJssJJAJsFJJJJJJJJJJsJJJsJssJJJJJJossssssossossosJF7FAAJFsoFFFFFFFFFAFooooFoo

Here, there are 340 quality scores. But there are 342 base calls. You can count them. There are 350 characters in

.$.$.$.$.................,.......,,,,,,..,..,,,,,,..................,....,,.....,,,....,,,...-1A..............................................,,,,,,...................................-1A........,,....................,...................,,,,,,.......,,.................,,,,,,.,.................,.....,,......,,,,,,,,,,,.,,,.....................,,,..,,

If we combine them according to the documentation, here are 4 .$, 2 -1A, 69 , and 267 ., for a total of 342 base calls. So there's a discrepancy between the number of base calls and the number of quality scores. And this happens quite frequently in this mpileup file. I have no idea how to assign the quality scores to the base calls when they differ like this.

Basically, for each position, I'd like to get ALL POSSIBLE mutations. If there's a tool that can do this instead, I would appreciate some pointers. As far as I can tell, bcftools tools does its own internal filtering and doesn't give all mutations at a position. So I had to resort to writing my own R function to parse the file. Here's the function

interpret_mpileup <- function(refbase, s, q) {
  # Split the characters in the calls and recombine them according to the documentation
  calls <- strsplit(s, "")[[1]]
  vec <- c()
  curCall <- c()
  curChar <- c()
  prevChar <- c()
  i <- 1
  while(i <= length(calls)) {
    prevChar <- curChar
    curChar <- calls[i]
    if (curChar == "$") { # If it's a $, append to the previous call
      vecLen <- length(vec)
      vec[vecLen] <- paste0(vec[vecLen], "$")
      i <- i + 1
    } else if (curChar %in% c(".", ",", ">", "<", "*", "a", "c", "g", "t", "A", "C", "G", "T")) { # If it's one of these characters, create a new call
      vec <- c(vec, curChar)
      curCall <- curChar
      i <- i + 1
    } else if (curChar == "^") { # If it's a ^, this call will include the current and following two characters
      curCall <- paste0(curChar, calls[i+1], calls[i+2])
      vec <- c(vec, curCall)
      i <- i + 3
    } else if ((curChar == "+") || (curChar == "-")) { # If it's a + or -, take as many of the following characters as long as they fall into the list below. This handles variable-sized indel calls.
      curCall <- curChar
      i <- i + 1
      while(calls[i] %in% c("A","C","G","T","a","c","g","t","*","#",as.character(0:9))) {
        curCall <- paste0(curCall, calls[i])
        i <- i + 1
      }
      vec <- c(vec, curCall)
    } else { # For debugging, print this when some other case is encountered
      i <- i + 1
      print(paste(curChar, "undefined"))
    }
  }
  quals <- strsplit(q, "")[[1]] # Quality scores are single character. Split the score string by character
  retDF <- data.frame(Calls = vec, Quality = quals)

  # We're not interested in base calls that agree with the reference (i.e. . and ,). We're not
  # interested in introns (< and >)
  retDF <- retDF[-which(retDF$Calls %in% c(".", ",", "<", ">")),]
  # Remove any calls that indicate the beginning of a read
  inds_readBeginning <- grep("\\^.[\\.,]", retDF$Calls)
  if (length(inds_readBeginning) > 0) {
    retDF <- retDF[-inds_readBeginning,]
  }
  # Remove any calls that indicate the end of a read
  inds_readEndings <- grep("[\\.,]\\$", retDF$Calls)
  if (length(inds_readEndings) > 0) {
    retDF <- retDF[-inds_readEndings,]
  }
  # Convert the quality scores to numerical values
  retDF$Quality <- sapply(retDF$Quality, utf8ToInt)
  return(retDF)
}

And this function would be called with something like interpret_mpileup("T", "...$,,A+1CCG", "JJJJJJJ") to give a result of

Calls  Quality
A      74
+1CCG  74

So if someone can tell me where I'm going wrong in my code resulting in the discrepancy, I would really really appreciate it.

Thank you

samtools R RNA-Seq mpileup Variant-Calling • 1.4k views
ADD COMMENT
0
Entering edit mode

I think your maths is wrong.

4 "$" and 2 "-1A" is 4+2*3 = 10 characters to remove from the sequence line, reducing it from 350 to 340 chars.

Note that a modern samtools has --no-output-ends to remove ^char and $ sumbols, and --no-output-ins --no-output-ins --no-output-del --no-output-del to totally remove indel markups. That should get you exactly the same number of chars for both without needing complex parsing to remove these fields (unless of course you're trying to use them for your variant calling).

ADD REPLY

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6