Hi!
I had SNPs for 39 WES samples, some of them were from related individuals.
I wanted to check the kinship to see if there any mislabelling during the sample processing.
Finally I've build a nice tree with the code below.
It was validated with independent observations (family diagrams, ancestry).
All the unrelated individuals were connected above the FC (first cousins) line,
all sibs, half-sibs, and other relatives were where they should be.
The crucial steps were using IBS function to calculate distances and taking LD into account.
Without these two I got just misleading trees.
The default LD threshold (0.2) removed too many SNPs, I increased it to 0.5 to achieve higher sensitivity.
LD filtation reduced 500K SNPs to 16K.
#install SNPRelate as described here:
#http://www.bioconductor.org/packages/release/bioc/vignettes/SNPRelate/inst/doc/SNPRelateTutorial.html#installation-of-the-package-snprelate
#prepare multisample vcf with bcftools merge
library(gdsfmt)
library(SNPRelate)
setwd([your dir here])
#biallelic by default
snpgdsVCF2GDS("dataset1.vcf", "dataset1.gds")
snpgdsSummary("dataset1.gds")
genofile = snpgdsOpen("dataset1.gds")
#LD based SNP pruning
set.seed(1000)
snpset = snpgdsLDpruning(genofile,ld.threshold = 0.5)
snp.id=unlist(snpset)
# distance matrix - use IBS
dissMatrix = snpgdsIBS(genofile , sample.id=NULL, snp.id=snp.id, autosome.only=TRUE,
remove.monosnp=TRUE, maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)
snpgdsClose(genofile)
snpHCluster = snpgdsHCluster(dissMatrix, sample.id=NULL, need.mat=TRUE, hang=0.01)
cutTree = snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL,
col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE,
verbose=TRUE)
snpgdsDrawTree(cutTree, main = "Dataset 1",edgePar=list(col=rgb(0.5,0.5,0.5,0.75),t.col="black"),
y.label.kinship=T,leaflab="perpendicular")
I hope this will be helpful for somebody.
Sergey
You can use SNPhylo --> http://chibba.pgml.uga.edu/snphylo/
The only suggestion in order to get it work is that the chromosomes ids in your vcf file have to be numbers (1,2,3,4...) and not like Chr1 or Gm01.
Dear all,
Wanted to revive this topic a little bit, since I am working with exact same data. I have a VCF file and would like to construct a tree of the samples in the VCF.
What would be the way to go today? Are there any new tools?
Thanks,
Gregor
SNPhylo web page not there, a friend passed me the repo link: https://github.com/thlee/SNPhylo
Hope this helps, Gregor
Hi all,
My question is related to this one, so that's the reason I'm writing here and not in new post. There is a way to make a phylogenetic tree, just as in Williams answer, but using maximum likelihood method? There is a function inside
SNPRelate
package calledsnpgdsIBDMLE
which performs this task but I'm not able to get a phylogenetic tree image.Any suggestion?
Hi William, I read your post and saw no mention of recombination or rooting. I am trying to do the same thing as you but am worried about these issues. Given recombination a phylogenetic tree does not accurately reflect the underlying processes. All I want my tree to do is to summarise the similarities between the strains and give an alternative to the PCAs. What did you end up doing and what do you think the tree means? I used unlinked snps for the PCAs and was thinking about using all of the snps for the trees. Is this sensible?