Hello, I have 40,637 short sequences (probes) in a fastq file named "seq.fq".
- First, I mapped them against a reference genome (hg19) using BWA ("bwa aln ...").
- Then, I converted the alignments from suffix-array coordinates into chromosomal coordinates ("bwa samse ..."), and obtained the results into a SAM file named "seq_aln.sam".
Finally, I counted the number of occurrences for each flag:
$ grep -v "@" seq_aln.sam | awk -F"\t" 'BEGIN{print "flag\toccurrences"} {a[$2]++} END{for(i in a)print i"\t"a[i]}'
flag......occurrences
4.........3083
0.........19039
16.......18515
According to this page, the "4" flag means that the short sequence doesn't map onto the reference genome, and the "16" flag means that the short sequence does map on the reverse strand of the reference genome.
But, what does the "0" flag mean? According to this forum page, it means "the read is not paired and mapped, forward strand", which is unclear to me... Does it mean "it is not paired but it maps on forward strand"? Or "it is neither paired nor maps on forward strand"? Or "it is neither paired nor maps on any strand"?
At the end, does all this mean that I can work with only 18,515 short sequences out of 40,637?
Thanks for your help!
Thanks a lot, your answer should be added to the description of the SAM man page...
I agree with tflutre, this should be added to the SAM documentation. I work with influenza virus, a virus with an RNA genome in the negative-sense orientation. I expected my cDNA library would contain reads mapping to the reference sequence in both sense and antisense orientation, and I was getting flags of 0 and 16. The reads flagged as 0 certainly mapped places, but it would be nice to have some good documentation explaining this. Thanks!
@Jts, can you comment on Devon Ryan's comment in the answer to my question? I actually realize I agree with him on it that FLAG 0 only mean it's not reverse complemented, not necessarily mapped to forward/+ strand.
A flag field of 0 may or may not mean that a read is actually mapped to the + strand. All it means is that it wasn't reverse complemented. Whether that means it should be assigned to one strand or the other or none at all is dependent on other factors and isn't encoded in BAM files. As an example, most common single-end RNAseq experiments use a stranded protocol wherein alignments with a flag of 0 come from the "-" strand.
Link is here: SAM flag and select reads that map uniquely
@Jts has not been seen on Biostars for 4+ years, just so you know ....
@ genomax2, thanks for the note...