I use bwa-mem and samtools mpileup to align my species to a reference sequence for a few genes. My problem is that, first, the reference species is somewhat distantly related to the rest of the order I am studying. However, my real problem is that after running bwa-mem and gathering consensus sequences for each species after generating mpileup files and aligning all my sequences into one multiple sequence alignment - there are regions along the gene that contain gaps where maybe 8-15 out of 50 species had some reads align to that region. When I look at that region and for what species did get reads to align, it's very variable (in relation to the reference and to each other). I think this is why I cannot get anything to align, because flanking these regions are fairly conserved regions that I can get about 40-50 species to have mapped reads.
So my main question is - is there a way to set up the reference file so that I can somehow improve these variable regions to map. For example I tried inserting N's into these regions in the reference file hoping that it would allow more bases to align - I even tried leaving these areas as gaps hoping that would make it less stringent (a little bit de novo maybe?).
Just wondering if anyone knew of a way to improve the mapping for variable regions or when the reference species is distantly related. Thank you!
Short read aligners expect the reads to have very low variability, therefore using a reference genome which is not the same as the source will cause the problem you describe.
I can suggest trying to build your reference using de novo assembly if you have enough coverage and computer resources.
Also, if the reads are not so many, you can use Blast+, which has a mode for short reads alignment.
How would I use Blast+ if my organisms don't have references or have the genes sequenced previously?? The reference species exists on ensembl so would that be an option? I do have a lot of reads, each species has over 1 million single-end reads.
Instead of aligning to the closed-related reference with BWA, use Blast+, blastn has a "blastn-short" mode, also there is Magic-blast https://ncbi.github.io/magicblast/