I'm currently trying to use samtools calmd to calculates MD and NM tags for my bam files, I noticed that the typical usage mentioned in the documentation (http://www.htslib.org/doc/samtools-calmd.html) is
samtools calmd -bAr aln.bam > aln.baq.bam
and the params -b -A -r means:
OPTIONS
-A
When used jointly with -r this option overwrites the original base quality.
-e
Convert a the read base to = if it is identical to the aligned reference base. Indel caller does not support the = bases at the moment.
-u
Output uncompressed BAM
-b
Output compressed BAM
-C INT
Coefficient to cap mapping quality of poorly mapped reads. See the mpileup command for details. [0]
-r
Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).
-E
Extended BAQ calculation. This option trades specificity for sensitivity, though the effect is minor.
--no-PG
Do not add a @PG line to the header of the output file.
-@, --threads INT
Number of input/output compression threads to use in addition to main thread [0].
I noticed that rewriting base quality is mentioned here, what does base quality refer to here? Is it mapping Quality? If so, why does the calculation of the NM value affect the judgment of mapping Quality?
That quoted usage is an example and has a clear description about applying BAQ. It's not a recommandation for how it should be used in most scenarios. I don't really know anyone who uses BAQ in this way, so the example in the man page is probably very out-dated.
Also Calmd is frankly a basket-case of a subcommand. It appears to have been the catch-all for many features, mostly completely unrelated to calculating the MD tag. It even has an undocumented -N option which can be used to turn off the MD/NM calculation bit, so it doesn't do what the command claims to do! All things considered, I find it a bit of a baffling set of decisions, but we can't really change CLI without breaking pipelines.
The latter one is optimal for when it is part of a pipeline, where we want to be inputting and outputting uncompressed BAM between pipeline stages for efficiency. (I mean "pipeline" in the traditional UNIX sense, not the inefficient batch-scheduling task-running modern bioinformatics sense.)
Thanks for your answer, may I take it that samtools calmd [-b] [-u] are the typical usage to calculate MD/ND, in this application, most of the other parameters do not need too much relationship?
Thanks for your answer, may I take it that
samtools calmd [-b] [-u]
are the typical usage to calculate MD/ND, in this application, most of the other parameters do not need too much relationship?