I was compelled to zombify this thread, because I hacked my way through an analogous problem a few days ago. It's an ugly solution. There are probably more elegant ones in R, and surely more elegant ones using another tool.
## create example data
vcfInput=structure(list(V1 = c("chr37", "chr37"), V2 = c(27100083L, 27100196L
), V3 = c("G", "C"), V4 = c("C", "G"), V5 = c(999, 43.5), V6 = c("DP=70;VDB=0.0256;AF1=0.9549;AC1=25;DP4=1,3,37,24;MQ=48;FQ=72.8;PV4=0.3,2.1e-13,0.15,1",
"DP=78;VDB=0.0305;AF1=0.04422;AC1=1;DP4=36,37,2,3;MQ=49;FQ=43.7;PV4=1,1e-06,0.45,0.4"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), row.names = 1:2, class = "data.frame")
colnames(vcfInput)=c("chr", "pos","ref","alt","qual","packed")
## define a trivial convenience function
getAllListSubItemsByIndex <-function (theList, theIndex){ paste(lapply(theList,function(x) x=x[[theIndex]]))}
## identify each kind of coded value packed in the string:
valNames=unique(unlist(lapply(1:nrow(vcfInput),function(i) getAllListSubItemsByIndex (strsplit(strsplit(vcfInput $packed[i], split=";")[[1]], split="="),1))))
## add on to the existing data frame
newColIndices=(1 + ncol(vcfInput)):(ncol(vcfInput)+length(valNames))
vcfInput[, newColIndices]=NA
colnames(vcfInput)[newColIndices]= valNames
## collect the values
for (i in 1:nrow(vcfInput)){
pairs=strsplit(vcfInput$packed[i], split=";")
vals=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),2)
currentValNames=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),1)
vcfInput [i, currentValNames]= vals
}
Not an answer really, but CIGAR parsing is best tackled with regular expressions. Thus, the existing Perl and Python BAM APIs are a better solution. IMHO, this sort of work is not in R's wheelhouse.
HTseq: has parse_cigar function to work with CIGAR strings:
http://www-huber.embl.de/users/anders/HTSeq/doc/alignments.html#cigar-strings
Why not retrieving the reference sequence just by matching the exact coordinate of the read in the reference file and then make your comparison ? (instead of trying to regenerate ref with CIGAR)
thank you all for your answers i will try it