Multi-Sample Vcf To Phylogenetic Tree.
6
16
Entering edit mode
11.2 years ago
William ★ 5.3k

How can I construct a phylogenetic tree based on the SNP's shared between strains? I have whole genome SNP calls for 10 different strains in a multi-sample vcf.

Are there any tools that can take the vcf as an input for creating phylogenetic trees? Or do I need to convert the multi-sample vcf to another matrix? Which kind of matrix would that be how can I create it from the vcf?

Is there a list somewhere of popular packages that kan be used for creating phylogenetic trees? Or guides on how to go from a multi-sample vcf to a plylogenetic tree.

vcf • 36k views
ADD COMMENT
2
Entering edit mode

You can use SNPhylo --> http://chibba.pgml.uga.edu/snphylo/

The only suggestion in order to get it work is that the chromosomes ids in your vcf file have to be numbers (1,2,3,4...) and not like Chr1 or Gm01.

ADD REPLY
1
Entering edit mode

Dear all,

Wanted to revive this topic a little bit, since I am working with exact same data. I have a VCF file and would like to construct a tree of the samples in the VCF.

What would be the way to go today? Are there any new tools?

Thanks,
Gregor

ADD REPLY
1
Entering edit mode

SNPhylo web page not there, a friend passed me the repo link: https://github.com/thlee/SNPhylo

Hope this helps, Gregor

ADD REPLY
0
Entering edit mode

Hi all,

My question is related to this one, so that's the reason I'm writing here and not in new post. There is a way to make a phylogenetic tree, just as in Williams answer, but using maximum likelihood method? There is a function inside SNPRelate package called snpgdsIBDMLE which performs this task but I'm not able to get a phylogenetic tree image.

Any suggestion?

ADD REPLY
0
Entering edit mode

Hi William, I read your post and saw no mention of recombination or rooting. I am trying to do the same thing as you but am worried about these issues. Given recombination a phylogenetic tree does not accurately reflect the underlying processes. All I want my tree to do is to summarise the similarities between the strains and give an alternative to the PCAs. What did you end up doing and what do you think the tree means? I used unlinked snps for the PCAs and was thinking about using all of the snps for the trees. Is this sensible?

ADD REPLY
16
Entering edit mode
11.1 years ago
William ★ 5.3k

Here is what I did in the SNPrelate package to get a dendogram and pca from my multisample vcf file

#vcf to GDS
snpgdsVCF2GDS("my.vcf", "my.gds")
snpgdsSummary("my.gds")
genofile <- openfn.gds("my.gds")

#dendogram
dissMatrix  <-  snpgdsDiss(genofile , sample.id=NULL, snp.id=NULL, autosome.only=TRUE,remove.monosnp=TRUE, maf=NaN, missing.rate=NaN, num.thread=10, verbose=TRUE)
snpHCluster <-  snpgdsHCluster(dist, sample.id=NULL, need.mat=TRUE, hang=0.25)
cutTree <- snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL,col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE, verbose=TRUE)

#pca
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
pop_code <- read.gdsn(index.gdsn(genofile, "sample.id")
pca <- snpgdsPCA(genofile)
tab <- data.framesample.id = pca$sample.id,pop = factor(pop_code)[match(pca$sample.id, sample.id)],EV1 = pca$eigenvect[,1],EV2 = pca$eigenvect[,2],stringsAsFactors = FALSE)
plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),xlab="eigenvector 2", ylab="eigenvector 1")
legend("topleft", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))
ADD COMMENT
1
Entering edit mode

Where are you defining the dist obejct in the code above? Its throwing error. Am I supposed to replace dist with dissMatrix?

Also this line needs another closing parentheses

pop_code <- read.gdsn(index.gdsn(genofile, "sample.id")

oh and this line is messed up too...need a parentheses after data.frame

tab <- data.framesample.id = pca$sample.id,pop = factor(pop_code)[match(pca$sample.id, sample.id)],EV1 = pca$eigenvect[,1],EV2 = pca$eigenvect[,2],stringsAsFactors = FALSE)
ADD REPLY
0
Entering edit mode

Hi it seems a good tool, but how should I proceed if I want to construct a Phylogenetic tree from .vcf files from different samples. Do i have to concatenate them to create a multisample vcf or i can manege them independently to create the tree.?

Any advice is helpfull

Best Celso

ADD REPLY
0
Entering edit mode

You could use either approach.

  1. Use a tool to create a multi-sample VCF (e.g. VCFtools)

    or

  2. Use snpgdsVCF2GDS() to read in each VCF, then merge in R using snpgdsCombineGeno().

ADD REPLY
0
Entering edit mode

Thanks Neil, I'll try your recomendations.

ADD REPLY
0
Entering edit mode

when i execute the R script i get the following error "Removing 181 non-autosomal SNPs Error in snpgdsDiss(genofile) : There is no SNP!" Obviously, the error is because there aren't SNPs. What parameter do i need to change to avoid this problem? , Thanks!

ADD REPLY
0
Entering edit mode

As I understand it this creates a PCA from all of the snps in vcf. How can one filter a vcf so as to get only unlinked variants?

ADD REPLY
3
Entering edit mode
11.2 years ago
Neilfws 49k

I'd look at the R package SNPRelate. It will read VCF files, create various matrices and plot dendrograms using e.g. an identity-by-state matrix. See examples in the vignette PDF.

ADD COMMENT
0
Entering edit mode

Thanks the program worked really nice for creating both a phylogenetic tree and a principal component analysis plot almost directly on the vcf file.

ADD REPLY
3
Entering edit mode
8.2 years ago

Hi!

I had SNPs for 39 WES samples, some of them were from related individuals. I wanted to check the kinship to see if there any mislabelling during the sample processing.

Finally I've build a nice tree with the code below.

It was validated with independent observations (family diagrams, ancestry). All the unrelated individuals were connected above the FC (first cousins) line, all sibs, half-sibs, and other relatives were where they should be.

The crucial steps were using IBS function to calculate distances and taking LD into account.
Without these two I got just misleading trees. The default LD threshold (0.2) removed too many SNPs, I increased it to 0.5 to achieve higher sensitivity. LD filtation reduced 500K SNPs to 16K.

#install SNPRelate as described here:
#http://www.bioconductor.org/packages/release/bioc/vignettes/SNPRelate/inst/doc/SNPRelateTutorial.html#installation-of-the-package-snprelate
#prepare multisample vcf with bcftools merge 

library(gdsfmt)
library(SNPRelate)
setwd([your dir here])

#biallelic by default
snpgdsVCF2GDS("dataset1.vcf", "dataset1.gds")
snpgdsSummary("dataset1.gds")
genofile = snpgdsOpen("dataset1.gds")

#LD based SNP pruning
set.seed(1000)
snpset = snpgdsLDpruning(genofile,ld.threshold = 0.5)
snp.id=unlist(snpset)

# distance matrix - use IBS
dissMatrix  =  snpgdsIBS(genofile , sample.id=NULL, snp.id=snp.id, autosome.only=TRUE, 
    remove.monosnp=TRUE,  maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)
snpgdsClose(genofile)

snpHCluster =  snpgdsHCluster(dissMatrix, sample.id=NULL, need.mat=TRUE, hang=0.01)

cutTree = snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL, 
    col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE, 
    verbose=TRUE)

snpgdsDrawTree(cutTree, main = "Dataset 1",edgePar=list(col=rgb(0.5,0.5,0.5,0.75),t.col="black"),
    y.label.kinship=T,leaflab="perpendicular")

I hope this will be helpful for somebody.

Sergey

ADD COMMENT
2
Entering edit mode
10.4 years ago

Thanks for the tip! I want to draw a dendrogram based on RADtag based snps from a non-model organism, and the vcf file output by Stacks. I managed to do this (only) with IBS, using these commands:

snpgdsVCF2GDS("batch_511.vcf", "batch_511.gds")
snpgdsSummary("batch_511.gds")
genofile <- openfn.gds("batch_511.gds")
set.seed(100)
ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2))
rv <- snpgdsCutTree(ibs.hc)
plot(rv$dendrogram, leaflab="perpendicular", main="Batch 511")

I suppose the dendrogram is based on distance clustering, but that's not clear from the documentation. Does anyone know? And what are the units of the scalebar in the resulting graph?

Finally, I haven't yet succeeded in getting any results with the ML alternative in SNPSRelate. Should it be possible at all without phased data?

Louis Boumans

ADD COMMENT
1
Entering edit mode
9.8 years ago
danrdanny ▴ 70

I just got done installing SNPRelate on R 3.1.1, OSX Yosemite. In order to save yourself some time do the following.

First, make sure you have gfortran installed. Follow instructions to install gfortran from here: http://www.thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error

Second, download and install both gdsfmt and snprelate from source: https://github.com/zhengxwen/SNPRelate

ADD COMMENT
1
Entering edit mode
5.6 years ago
rm.umayal24 ▴ 10

Hi all,

In order to create the phylogenetic tree from whole genome SNP file including the human data, the following software would be helpful.

It is known as the VCF2PopTree software and it is available on http://sankarsubramanian.net/dat/index.html. It is so cool and it does not need any dependencies. Just a HTML file is sufficient enough to get the Phylo tree.

Very simple and straight forward. Highly recommended for the evolutionary biologists and population geneticists.

ADD COMMENT
0
Entering edit mode

I used it on SNPs of one Chromosome, is that okay ?

ADD REPLY
0
Entering edit mode

Yes. It's okay. It doesn't matter whether you use SNPs from one chromosome or more.

ADD REPLY
0
Entering edit mode

How about with bacterial genome ? I have SNP pairwise distance matrix, I want to create cluster based core snp radial phylogenetic tree and see the difference in snps between my samples and also sample and reference.

ADD REPLY

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6