Hi all,
I have been trying to create a same-length consensus fasta by mapping reads to a reference genome for input into angsd as the ancestral sequence. However, the scaffolds of the resulting consensus always seem to be a little short, likely due to indels etc.. I have iterated through several versions of code (bcftools call -c or bcftools call -m, not using any data filters, etc.). Is there a specific path or set of filters I should use to be doing this? I found some former issues documented of this in previous versions of samtools/bcftools that had to do with an older version of the program but that doesn't seem to be the issue here. I also tried to use LAST for mapping, but that didn't seem to sort it out.
Apologies for the juvenile question and thanks for any help.
It is not clear what the desired course should be if there are indels in the genome. For example, if the genome is missing 10 bases from the middle how would you "make" that full size again? Would you want to insert 10 bases?
Thanks for your reply! Yes--I agree this is problematic. I think that for indels it would be best to have N's where they are in the consensus sequence. This is discussed here (http://seqanswers.com/forums/showthread.php?p=129352#post129352 the thread is a bit long, sorry, comment #14-16 is relevant to this issue). There is some reference to making a "dummy sam", but I am not exactly sure what they mean by that.
This is such an unusual problem that there might not be a tool do exactly what you need.
One perhaps the simplest option would be to align your best consensus to the old with
bwa mem
then run sam2pairwisehttps://github.com/mlafave/sam2pairwise
Now you have the two sequences lined up. At that point the "hard" part is done. But you'll still need to write a simple but custom script in say Python to navigate the two in parallel and produce your final consensus.