If a read overlaps a region perfectly except for a single masked base pair position, does bowtie consider that position a mismatch in the read? Or does bowtie ignore that position, and consider it a perfect alignment?
I'm thinking about aligning to a reference genome that is hard-masked to SNPs, and I'm wondering if this will cause fewer reads to align due to more mismatches.
Alignments involving one or more ambiguous reference characters (N, -, R, Y, etc.) are considered invalid by Bowtie. This is true only for ambiguous characters in the reference; alignments involving ambiguous characters in the read are legal, subject to the alignment policy. Ambiguous characters in the read mismatch all other characters. Alignments that “fall off” the reference sequence are not considered valid.
So any ambiguous character in the reference overlapping potentials alignment locations will render the read unmapped from what I unerstand.
You can easily explore this by simply making a dummy reference genome, e.g.:
#/ Dummy genomes:
echo ">dummy TAGCTGCGCGCTACGATCGATCGACTGATCAGCGGCTNTAGCTGTACATGCA" | tr " " "\n" > dummy_N.fa
echo ">dummy TAGCTGCGCGCTACGATCGATCGACTGATCAGCGGCTCTAGCTGTACATGCA" | tr " " "\n" > dummy_no_N.fa
#/ Index:
for i in dummy*.fa; do bowtie-build $i $i; done
#/ Dummy read (same as the dummy_N.fa sequence)
echo ">dummy_read TAGCTGCGCGCTACGATCGATCGACTGATCAGCGGCTNTAGCTGTACATGCA" | tr " " "\n" > dummy_read.fa
bowtie -f dummy_N.fa dummy_read.fa
# reads processed: 1
# reads with at least one alignment: 0 (0.00%)
# reads that failed to align: 1 (100.00%)
No alignments
bowtie -f dummy_no_N.fa dummy_read.fa
# reads processed: 1
# reads with at least one alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1 alignments
My whole interest in this question is because I'm interested in allele-specific expression, and I was wondering if I could mask SNPs in my reference so as to not bias the alignment of the reads toward the reference. But I see that if I replace SNPs with N in the reference, no reads will overlap the SNPs.
When bowtie is aligning a read overlapping a SNP site, is there a way to make bowtie align a read with A, C, G, or T at that SNP site without considering any nucleotide a mismatch?
How long are your reads?
The reads are 50 base pairs.