Hi!
I am getting confused with couple samtools outputs. I have run bwa-mem with -M flag to retrieve chimeric reads with 256 flag, and converted the sam file to bam file. Problem is, in the original bam file, if I want to count the lines with 256 flag, and if I want to retrieve "SA" tags, the numbers do not add up. Should not they be equal if they are both to show chimeric reads?
Example:
samtools view -h myfile.bam | grep 'SA:Z' > myfileSA_filtered.bam
samtools view -f 256 -c myfile.bam
1987783
and
samtools view -c myfileSA_filtered.bam
300281
I expect these numbers to be the same but they are quite far away from each other. I would appreciate your help!
I think the 256 flag only means that it is not a primary alignment. This can then be a chimeric one that has been tagged as 256 due to the -M option, but also a multimapping read. If reads multimap, then one is randomly chosen to be primary, all others get flagged as secondary. That means SA:Z is only a subset of 256, and therefore, the number for the grep is smaller than for counting 256.
Yes, followed the same thought, but then I cut the QNAME part, sorted and counted the unique ones in 256flagged bam file. Then I counted the unique QNAMEs in myfileSA_filtered.bam, and the numbers were still different
first: how do know there is no 'SA:Z' substring in the quality string ?
Because I did not only use SA:Z, it also includes the accession number of the genome, therefore making it specific to its chimerics
second: 256 if for 'secondary alignment' = an alternative alignment for the current read. I think you want 2048 'supplementary alignement'
well, I wanted to retrieve the chimeric reads, so I thought to grab the ones which have another place to be. I have used bwa-mem -M -a to get all alignments, but since I have used -M, as a flag, I used 256 instead of 2048. Would it be a wrong choice?
256 would be used for a read mapping a duplicated gene: a read could have a two possible position (primary + not-primary=256 alignement)
a first part of the read is mapped on a region while the other part is clipped. There is a second alignment (2048) for this second part, mapping another region of the genome.