Hello biostars,
I am trying to get into popgen analysis in angsd and currently working on some summary statistics. This made me to think a little bit more about allele polarization for D-stat, f3, f4 stats, SFS and other analyses.
For D-statistic estimate, ANGSD is asking for ancestral fasta file. However, I am not sure what kind of fasta it should be. If all my BAMs are aligned, let's say, to hg19, but an outgroup is chimp, should I provide a reference PanTro genome? In this case, coordinates are different: BAMs are aligned to hg19.
Or, should I convert PanTro bam file aligned to hg19 into some kind of consensus fasta? Or, finally, I can realign all bams on chimp genome, and then use these realigned bams together with PanTro for the analysis. What is the best way to do that?
I guess it would be better to use a real outgroup to polarize alleles (especially when doing SFS), but some papers (as this one) use non-outgroup reference and do that using folded SFS with no problems.
In general, is there an optimal strategy for this kind of popgen decision making?
Apologies that no-one else has responded. It is a very specific type of analysis that you are aiming to do, but very interesting I must admit.
From what I can see, ANGSD could accept a BAM aligned to hg19 and another aligned to the Pan troglodytes, however, this may not necessarily be the correct way to run the program.
I noticed this recent study, which appeared to run ANGSD separately on 3 different species: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4788117/
Thanks! Yeah, it is not an easy question. I ended up aligning chimp on hg19. Other part of my question is very theoretical, I looked through the literature to see what people do - and they do whatever data allows. Some for examples do not have a sequenced outgroup so they just use a reference.
Hi Alice, I am doing unfolded SFS, I didn't know how to use a real outgroup to polarize alleles, can you give me some suggestions?