I have created a bowtie aligned .sam file with this command:
bowtie -S -p 18 /media/owner/ref.fa test1.fastq test1.sam
which I then converted to .bam file like this:
samtools view -bS -o test1.bam test1.sam
From this test1.bam, I was able to get the total number of plus and minus strand of reads aligned to the reference loci using bedtools coverage maping (for positives reads and negative reads) like this:
bedtools coverage -a my_region.bed -b test1.bam -bed -d -s | gzip > test1_region_pos.tsv.gz
Since I used default bowtie option which allows for three mismatches, I want to know how many reads are aligned with terminal mismatches (i.e,last three, last two and last one base mismatched at 3' region of the aligned reads at their loci position). How can I get this information? Thank for your help in advance.
Could you please elaborate a bit? Do you mean MD tag within samtools? Thanks for your help.
The MD tag is one of the predefined standards in SAM/BAM file specification, used to represent mismatches. More info can be found here.
EDIT for more information: The MD tag contains the REFERENCE nucleotide at a position if there is a mismatch, so any read that contains a terminal mismatch with contain G/A/T/C in the last position of this tag.
I need positional mismatch. For example, I need total number of reads of length 18 bases where with total number of mismatch at 18th position. Same for 19bases long sequences position and so on.
For that you could check the CIGAR flag ans check for reads with the last base as mismatch. For 18 bp reads it should be 17M1S
I believe 'S' represents soft clipping in the CIGAR. 'X' represents mismatch, but was not used in earlier versions of SAM.
As it's the last bp I guess it will be associated to soft clipping
Bowtie didn't support clipping, only end-to-end alignment (caveat: relying on memory of old software I haven't used in a long time).