Is There Any R Package To Parse Cigar-Element Of Sam Files?
2
3
Entering edit mode
13.5 years ago
Eva ▴ 30

Hey,

i need to parse the Cigar-Element or rather the MD field of SAM files to get the reference sequence, is there any script/tool or package (especially for R)?


For example:

Cigar: 4M

MD: 0C0T2

Sequence: ACGT

result what i want is: CTGT

so that i can just count the mismatches by comparing strings, or is there any other way?

cigar samtools r parsing • 9.5k views
ADD COMMENT
1
Entering edit mode

Not an answer really, but CIGAR parsing is best tackled with regular expressions. Thus, the existing Perl and Python BAM APIs are a better solution. IMHO, this sort of work is not in R's wheelhouse.

ADD REPLY
1
Entering edit mode

HTseq: has parse_cigar function to work with CIGAR strings:

http://www-huber.embl.de/users/anders/HTSeq/doc/alignments.html#cigar-strings

ADD REPLY
0
Entering edit mode

Why not retrieving the reference sequence just by matching the exact coordinate of the read in the reference file and then make your comparison ? (instead of trying to regenerate ref with CIGAR)

ADD REPLY
0
Entering edit mode

thank you all for your answers i will try it

ADD REPLY
4
Entering edit mode
12.8 years ago
Michael 55k

I have made some very simple R-functions that can parse the CIGAR string. I put it here for your reference.

matcher <- function(pattern, x) {

  ind = gregexpr(pattern, x)[[1]]
  start = as.numeric(ind)
  end = start + attr(ind, "match.length")- 2
  apply(cbind(start,end), 1, function(y) substr(x, start=y[1], stop=y[2]));
}
doone <- function(c, cigar) {
  pat <- paste("\\d+", c , sep="")
  sum(as.numeric(matcher(pat, cigar)), na.rm=T)
}


## takes a cigar string and parses it, not very fast but...
cigarsums <- function(cigar, chars=c("M","N","D","I","S","H", "P", "X", "=")) {
   sapply (chars, doone, cigar)
}

Now this can be used in the following way:

library(Rsamtools)
example(readBamGappedAlignments)
 sapply (aln1[140:150,]@cigar, cigarsums)
  40M 35M 36M 18M5I12M 40M 14M5I17M 11M5I19M 35M 35M 35M 35M
M  40  35  36       30  40       31       30  35  35  35  35
N   0   0   0        0   0        0        0   0   0   0   0
D   0   0   0        0   0        0        0   0   0   0   0
I   0   0   0        5   0        5        5   0   0   0   0
S   0   0   0        0   0        0        0   0   0   0   0
H   0   0   0        0   0        0        0   0   0   0   0
ADD COMMENT
1
Entering edit mode
13.1 years ago
User 2383 ▴ 10

I was compelled to zombify this thread, because I hacked my way through an analogous problem a few days ago. It's an ugly solution. There are probably more elegant ones in R, and surely more elegant ones using another tool.

## create example data
vcfInput=structure(list(V1 = c("chr37", "chr37"), V2 = c(27100083L, 27100196L
), V3 = c("G", "C"), V4 = c("C", "G"), V5 = c(999, 43.5), V6 = c("DP=70;VDB=0.0256;AF1=0.9549;AC1=25;DP4=1,3,37,24;MQ=48;FQ=72.8;PV4=0.3,2.1e-13,0.15,1", 
"DP=78;VDB=0.0305;AF1=0.04422;AC1=1;DP4=36,37,2,3;MQ=49;FQ=43.7;PV4=1,1e-06,0.45,0.4"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), row.names = 1:2, class = "data.frame")
colnames(vcfInput)=c("chr", "pos","ref","alt","qual","packed")

## define a trivial convenience function
getAllListSubItemsByIndex <-function (theList, theIndex){ paste(lapply(theList,function(x) x=x[[theIndex]]))}

## identify each kind of coded value packed in the string:
valNames=unique(unlist(lapply(1:nrow(vcfInput),function(i) getAllListSubItemsByIndex (strsplit(strsplit(vcfInput $packed[i], split=";")[[1]], split="="),1))))

## add on to the existing data frame
newColIndices=(1 + ncol(vcfInput)):(ncol(vcfInput)+length(valNames))
vcfInput[, newColIndices]=NA
colnames(vcfInput)[newColIndices]= valNames

## collect the values
for (i in 1:nrow(vcfInput)){
    pairs=strsplit(vcfInput$packed[i], split=";")
    vals=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),2)
    currentValNames=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),1)
    vcfInput [i, currentValNames]= vals
    }
ADD COMMENT
0
Entering edit mode

What is the code trying to do? Looks like it's parsing VCF in R - which may be useful to someone - but not SAM or CIGAR formats.

ADD REPLY

Login before adding your answer.

Traffic: 1694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6