VCF + Phylogenetic tree : how to calculate the distance between two set of samples
3
0
Entering edit mode
10.0 years ago

I'm trying to build a simple tool that would build a phylogenetic tree from a VCF.

My aim is to verify that the related + sequenced individuals are close in the phylogenetic tree.

My current code works with a small set of related individuals but it fails when some extra individuals are added. Current algorithm:

  • a genotype is an enumeration Gtype: HOM_REF, HOM_ALT, HET . 'N/A' is converted to HOM_REF.
  • I'm looking iteratively for the 'nodes' having the smallest distance. A node is a set of samples. The algorithm starts with all the possible pairs of samples.
  • For a pair of samples
    • AA vs AA: = distance=0
    • AA vs AB: = distance=5
    • AA vs BB: = distance=15
  • This is where I'm looking for the right algorithm : when merging two samples to create a 'merged node' how should I calculate the distance ?

My current algorithm for a position in the VCF is:

dist=0;
for(Gtype g1: allGenotypes(node1))
  for(Gtype g2: allGenotypes(node2))
      d += distance(g1,g2)
d /= ( countGenotypes(node1)+countGenotypes(node2))

What would be the best way to 'score' the distance between two sets of samples for a given position in the VCF?

vcf phylogenetic-tree distance • 4.6k views
ADD COMMENT
2
Entering edit mode
ADD COMMENT
2
Entering edit mode
10.0 years ago
David W 4.9k

It's not quite clear to me what makes a "merged" node, do you mean a group that is already "joined" as each other's closest relatives?

If so this seems like a classic hierarchical clustering problem. You could use UPGMA or neighbor-joining to create your tree from a distance matrix.

ADD COMMENT
1
Entering edit mode
10.0 years ago

I would use pairwise identity by state for the distance matrix.

ADD COMMENT

Login before adding your answer.

Traffic: 2559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6