SAM/BAM files are that kind of files with hidden treasures that needs to be unrevealed for the inexperienced user. That is me...
My case. After analyzing a BAM file when asking myself how many reads' mates remain unmapped
One possibility to answer this is by analyzing the FLAG values with samtools.
I understand FLAGS are formed by an unique combination of many other individual flags. So, all of these FLAGS values: 73, 89, 121, 153, 185, 137, 77, 141 and so on, contain the "8" , that in turn, should be indicating that the mate read remains unmapped. I got this information from this WEB page to get an idea about what information the FLAGS can provide
Now a summary..To answer this question I have analyzed a unique BAM file in two ways
- One is by counting the number of
*
present in column 7 (RNEXT value), because in agreement with the official SAM file specification, this could mean that your mate can be unmapped (This field is set as `` when the information is unavailable*). In this case, I got over 65000 sequences that could be unmapped - However, if I run
samtools view file.bam -f 8 | wc -l
, I ended with only 2903 sequences.
One possibility is that when using the -f
option, the program is looking for a lonely "8" in the FLAG field. But if I look in column 2 in the BAM file, I don't find any FLAG with only that lonely 8. That convinced me that the samtools view -f FLAG
try to find any combination of FLAG values that intrinsically contains that 8, and thus, it should provide with the information about how many mates are being unmapped
With all this information, I still are not fully confident in knowing what are the right answer to this question. Or I have serious doubts about what is the usefulness of using the -f
qualificator in the samtools view
command. Or maybe, many other "lonely" flags should be included in the searching because I miss some important information and/or the orientation do not seem to matter when the BAM file is generated
So what is exactly providing the
*
value in the 7th column ?Just "more robust" being said, I understand you believe that the 2903 answer is the correct one
The asterisk is from the aligner, each of which has different quirks in its output. For example, I recall that some aligners won't set mate alignment information if it aligners mates as singletons, though I don't recall exactly which ones do this.