Edit: Summary: In the mpileup file, I have different number of quality scores than base calls. For example, I might have 342 reads covering a position, but there will be 340 base calls. This makes parsing the mpileup file difficult
Edit: I should note that the mpileup I'm using was generated with samtools mpileup
(version 1.13) and not bcftools mpileup
. samtools was giving me all the data from the reads, whereas bcftools was not. So I decided to go with samtools
Hi,
I'm trying to do some variant calling using the mpileup generated from samtools. After reading the documentation for samtools mpileup (http://www.htslib.org/doc/samtools-mpileup.html), I wrote the following R code (see bottom) to try to parse the file. This takes in a reference base, the base calls, and the quality scores and returns all possible mutations. For example, one of the lines is
chrM 108 A 373 ......$.............,.......,,,,,,...,..,,,,,,..................,....,,....,,,....,,,..................................................,,,,,,...........................................,,....................,...................,,,,,,.......,,.................,,,,,,.,.................,.....,,CC.C..,,,,,,,,,,,.,,,....................,,,..,,.,,...................,,,,,,,^].^].^].^].^].^], AAJJF:JJssJs<JsJJJsJJaJJskjsJJJsJJJJossssJJJJJssJJJJJFoeFsAJsFsssJsJsJJsJJsssJsJssssJJJJJjsJJJJJJJJJJJ<JJJJJJFJJJJJJJJJJJFkJJsssJFJJJJsossssJJJssJJJeJJJJJsJJJJFJJJJJJJJJJJFJJjJJJJJJJssJJJJJJJJJJJJFJJJFJsJJsJJJFJsJJJJJJJJJJJJJsAssssJJJJJssssJJJJjAesJsJJJJJJJsssssoksskJJJsFJJJJJkJJJJsJJJsJsoJJJJJJoossssssssssssoJF<FJJJFssJJJJFJJJJJssssJssJkB;;;;;;;;;;;UU;;;;;;UUUUUUUYY===Y
And after running my function, I get that there are 3 reads that call a mutation at this position
Calls Quality
C 74
C 74
C 74
They all call the same mutation. This works just fine. There are 373 base calls in the above string, and there are 373 quality scores.
However, here's an example that doesn't work in the mpileup file
chrM 105 T 340 .$.$.$.$.................,.......,,,,,,..,..,,,,,,..................,....,,.....,,,....,,,...-1A..............................................,,,,,,...................................-1A........,,....................,...................,,,,,,.......,,.................,,,,,,.,.................,.....,,......,,,,,,,,,,,.,,,.....................,,,..,, EEEEJFFJFFssJsJJsJJJsJJFJJsJssjJJsJJJsssVsJJJJJsfJJJ7JFsfAs<JsJsssJsJsJJsFJ<sssFsJsossJJFJJJ7sJJJJJJJJJJ<JFJJ<FJ7JJJ7JJJJJF<JJsssJJJJJFssssssJJFssJJJoJJJJJsJJJJJFJJJJJJFJJJ7JFJeJJJJJJssJJFJJJ<FJJJFJJJJJFsJJsFJJJJsJJJJFJJJJJJJJoJssssJJJJJsssoJFJJJJVsJsJJJJJJJsssssoJssJJAJsFJJJJJJJJJJsJJJsJssJJJJJJossssssossossosJF7FAAJFsoFFFFFFFFFAFooooFoo
Here, there are 340 quality scores. But there are 342 base calls. You can count them. There are 350 characters in
.$.$.$.$.................,.......,,,,,,..,..,,,,,,..................,....,,.....,,,....,,,...-1A..............................................,,,,,,...................................-1A........,,....................,...................,,,,,,.......,,.................,,,,,,.,.................,.....,,......,,,,,,,,,,,.,,,.....................,,,..,,
If we combine them according to the documentation, here are 4 .$
, 2 -1A
, 69 ,
and 267 .
, for a total of 342 base calls. So there's a discrepancy between the number of base calls and the number of quality scores. And this happens quite frequently in this mpileup file. I have no idea how to assign the quality scores to the base calls when they differ like this.
Basically, for each position, I'd like to get ALL POSSIBLE mutations. If there's a tool that can do this instead, I would appreciate some pointers. As far as I can tell, bcftools tools does its own internal filtering and doesn't give all mutations at a position. So I had to resort to writing my own R function to parse the file. Here's the function
interpret_mpileup <- function(refbase, s, q) {
# Split the characters in the calls and recombine them according to the documentation
calls <- strsplit(s, "")[[1]]
vec <- c()
curCall <- c()
curChar <- c()
prevChar <- c()
i <- 1
while(i <= length(calls)) {
prevChar <- curChar
curChar <- calls[i]
if (curChar == "$") { # If it's a $, append to the previous call
vecLen <- length(vec)
vec[vecLen] <- paste0(vec[vecLen], "$")
i <- i + 1
} else if (curChar %in% c(".", ",", ">", "<", "*", "a", "c", "g", "t", "A", "C", "G", "T")) { # If it's one of these characters, create a new call
vec <- c(vec, curChar)
curCall <- curChar
i <- i + 1
} else if (curChar == "^") { # If it's a ^, this call will include the current and following two characters
curCall <- paste0(curChar, calls[i+1], calls[i+2])
vec <- c(vec, curCall)
i <- i + 3
} else if ((curChar == "+") || (curChar == "-")) { # If it's a + or -, take as many of the following characters as long as they fall into the list below. This handles variable-sized indel calls.
curCall <- curChar
i <- i + 1
while(calls[i] %in% c("A","C","G","T","a","c","g","t","*","#",as.character(0:9))) {
curCall <- paste0(curCall, calls[i])
i <- i + 1
}
vec <- c(vec, curCall)
} else { # For debugging, print this when some other case is encountered
i <- i + 1
print(paste(curChar, "undefined"))
}
}
quals <- strsplit(q, "")[[1]] # Quality scores are single character. Split the score string by character
retDF <- data.frame(Calls = vec, Quality = quals)
# We're not interested in base calls that agree with the reference (i.e. . and ,). We're not
# interested in introns (< and >)
retDF <- retDF[-which(retDF$Calls %in% c(".", ",", "<", ">")),]
# Remove any calls that indicate the beginning of a read
inds_readBeginning <- grep("\\^.[\\.,]", retDF$Calls)
if (length(inds_readBeginning) > 0) {
retDF <- retDF[-inds_readBeginning,]
}
# Remove any calls that indicate the end of a read
inds_readEndings <- grep("[\\.,]\\$", retDF$Calls)
if (length(inds_readEndings) > 0) {
retDF <- retDF[-inds_readEndings,]
}
# Convert the quality scores to numerical values
retDF$Quality <- sapply(retDF$Quality, utf8ToInt)
return(retDF)
}
And this function would be called with something like interpret_mpileup("T", "...$,,A+1CCG", "JJJJJJJ")
to give a result of
Calls Quality
A 74
+1CCG 74
So if someone can tell me where I'm going wrong in my code resulting in the discrepancy, I would really really appreciate it.
Thank you
I think your maths is wrong.
4 "$" and 2 "-1A" is 4+2*3 = 10 characters to remove from the sequence line, reducing it from 350 to 340 chars.
Note that a modern samtools has
--no-output-ends
to remove ^char and $ sumbols, and--no-output-ins --no-output-ins --no-output-del --no-output-del
to totally remove indel markups. That should get you exactly the same number of chars for both without needing complex parsing to remove these fields (unless of course you're trying to use them for your variant calling).