I will enumerate these mostly for future reference, it is not so easy to find this information in a readable form.
Plus, as it turns out it is not so easy to write these out in a more simplified language - I might be wrong actually, so I welcome corrections.
The fault is with the flag concept as it is confusing, non-intuitive and needlessly complicated:
- paired means that at the time of the alignment both read pairs were present and the aligner presumable assessed both when finding the most likely location (called mate rescue). The flag does not necessarily mean that the pair is also present in the BAM file, the BAM file may be post-processed.
- proper pair is a flag with no precise requirement. The aligner decides when to set the flag based on the discretion of the designer. Usually, it means that the read pairs have a certain orientation, both reads align, and the read pairs are within a certain distance.
- duplicate means that the read or template sequence has been identified as non-unique. It states that the alignment file contains at least one more read or template with an identical sequence. Typically a different software needs to be run to detect and mark duplicates and the process may detect identities of reads or read-pairs (templates). The duplication may be decided by sequence identity or by alignment identity.
- secondary alignments represent multiple alignments of a read. Usually, only those secondary alignments are reported that are not overlapping and do not cover the entire read. A read that fully matches with identical scores in multiple locations typically may not have all secondary alignments listed. Instead, the alignment quality will be zero and the alternative locations will be indicated in the
SA
tag. Secondary alignments usually represent partial alignments of a read in different locations of the genome. There is a lot leeway in how aligners report alternative alignments.
- supplementary alignments are what are called "chimeric" alignments. These are alignments that cover the entire read but do not follow consecutively in a linear fashion. Only a subset of aligners can detect chimeric alignments.
Take for example the sequence AAATTTGGGCCC
that produces two alignments at 1000
and 2000
When the alignments are non-overlapping one alignment will be marked as secondary:
10000 AATT primary
20000 GCCC secondary
If the alignments are non-linear and the aligned regions can be joined to cover the entire read then the alignment would be represented as supplementary like so:
10000 GGGCCC primary
20000 AAATTT supplementary
Annoying there is no "primary" flag, I am just listing like so for clarity. An alignment is primary if it is neither secondary nor supplementary...
Now what happens if we have this:
10000 AAATTT
20000 GGGCCC
the reported alignment may be that the second is marked as secondary or that only a single alignment is reported as a spliced alignment
AAATTTNNN...NNNGGGCCCC
The SAM specification can be read at:
https://samtools.github.io/hts-specs/SAMv1.pdf
SAM tags:
https://samtools.github.io/hts-specs/SAMtags.pdf
see SAM flags meaning
Note that samtools flagstat is only reading what flags are set. If the software you used to align doesn't ever apply the duplicate flag, then it won't ever be set, even if your sample has duplicates.
See also the
flagstat
man page, which describes each of these in terms of the FLAG bits that categorise it.