I've come up with a homework exercise for my class on SNP mapping and benchmarking. The comical part is I can't figure out why I am not mapping the SNP myself. I'm not a novice at this anymore, but I'm banging my head against the wall to get this to work.
For the exercise, I have taken a 13,000 bp reference genomic region for an existing genome. I then changed one base in the reference in a well known gene and synthesized fake reads using wgsim
such that the reads have only one single nucleotide difference against the reference.
(I used this command to generate the reads: wgsim -N 200000 -S 0 -e 0 -s 0 -R 0 -X 0 -1 70 -2 70 reference_with_SNP.fasta test_forward.fastq test_reverse.fastq
)
Next, using the reference, I generated an index (I'm using bwa
here for the example) and then mapped the reads with the SNP to the reference without the SNP: bwa mem reference.fasta test_forward.fastq test_reverse.fastq > test.sam
. I'm getting a SAM
file with all perfect matches (i.e. 70M
). I've modified a lot of the bwa mem
flags, but I can't seem to record any SNPs.
If I take a fasta
file of my mapping fastq
reads I can align the fasta
to the reference and see where the bases don't match.
Any ideas if my error is in the mapping of my SNPs (wrong bwa mem
flag) or in the read synthesis with wgsim
?
CIGAR M is an alignment match, not necessarily a nucleotide match. I'm not sure if bwa uses extended cigar.