had a question about a project I am conceptualizing. Since I have no experience yet dealing with nuclear DNA, I have some questions.
I have SNP data on 64 samples from my population of interest (~330,000 SNPs per sample using the HumanCNV370-Quad).
I will likely be SNP typing some more in the near future, but I wanted to see what I can do with the existing SNP data in regards to estimating archaic introgression. I know Sánchez-Quinto et al. (2012) and Reich et al. (2011) had used f4 statistics (described in depth by Patterson et al. (2012) here to estimate Neanderthal and Denisovan ancestry respectively using SNP data.
Basically, (f4(A,O;X,C))/(f4(A,O;B,C)) equals the estimator of Neanderthal ancestry when A=Denisovan, B=Neanderthal, C=YRI, O=Pan troglodytes or paniscus, and X=My data and other comparative populations.
I need to be able to align a Pan genome to the high coverage Altai Neanderthal and Denisovan genomes and the YRI genomes to extract polymorphism data for the ~330,000 rs #s the array typed, and then filter out cases of C-T/G-A (Modern-Archaic) sites. I have no idea how to start on this, and I was wondering if anyone here had an idea for where I should start? Thanks!
just guessing here but how about extracting the a few hundred sequences around each of your snps, say 150bp that cover the SNP somewhere randomly in the 150bp, aligning those to the other genomes and calling snps on those
I unfortunately do not know how to do that. Beyond assembling mitogenomes or getting BEAST to run, I'm still very new to computational stuff.
I have managed to find the Altai Neanderthal VCF files and start dl'ing them to the cluster. I am trying to filter the SNPs by these rs numbers, but I keep getting error messages as shown before