I have checked the documenation here: http://samtools.sourceforge.net/pileup.shtml
Here is an output of column 5 (read base column):
samtools view -h bamfile.bam | samtools mpileup -B -f reference.fasta - | awk '{split($5,chars,""); for(i=1;i<=length($5);i++) print chars[i]}' | sort | uniq -c
12864 $
527 *
590 +
236 -
59664 .
634 1
130 2
53 3
6 4
3 5
3 =
6205 A
4793 C
12863 F
4787 G
7 M
15 N
6692 T
12863 ^
The F
occurs because the mapping quality is always F
(thus I always see ^F
where a read starts). I find it a little strange that the mapping quality is F
for every read but that's actually not my question
[maybe it's because I chose -B
... need to continue reading what that BAQ computation is about ... I noticed that I lose information in the mpileup output if I don't use -B
, that's why I use it]
I wonder about these symbols as they are not mentioned in the docs (at least with respect to column 5): *
=
The =
symbol occurs in only 1 line whereas the *
symbols occur in multiple lines. The *
symbols also occur in that line where =
occurs, so here's an example showing both:
gi_identifier 378981 T 9 .$*+1=$*+1A.*+1=$*+1=$*+1M$*+1M$G+1T ]]]JJJ]]J
What do these symbols mean?
Make sure to refer to the correct documentation: http://www.htslib.org/doc/samtools.html
Oh and I am also wondering about that "+1M" ... ?
It does look weird to me as well. What aligner produced this bam file? Maybe post the sam header.
The bam file in the command is the result of a Geneious function that lets you extract the SNP columns of an alignment/bam:
https://support.geneious.com/entries/22583802-Exporting-snp-data-for-phylogenetic-analysis
When I use the same command on the original bam file, I only get expected/documented symbols.
I guess I'll just exclude Geneious from the pipeline and identify the SNP columns myself.