Question

How to score/validate a phylogenetic tree built upon whole-exon mutation data of different cancer samples?

0

Entering edit mode

9.8 years ago

ronald.findling ▴ 10

As described in the title I would like to score or validate some phylogenetic trees we created. To clarify the current state I have to give you some background information:

We are trying to analyse inhomogeneity in cancer samples of not metastasised colon cancer. We have 5 patients with 6 samples by patient (5 cancer samples, 1 blood sample). Sequencing, variation calling and adding additional annotation was done by an external institute using GATK and snpEff. We received .bam, .vcf and .tsv files for every sample, where the .tsv files have pretty much the same information as the .vcf files.

We decided to go with the .tsv files and have done the following steps up to now:

Filter
1. Filter based on read quality: FILTER="PASS" (bash script)
2. Remove common mutations: db_snp.COMMON !=1 (bash script)
Create a table which file has which mutation (bash scripts)
1. combine the following columns into an IDstring for each mutation: CHROM POS ALT
2. Create the table/csv: rows=fileIDs; columns=muationIDs; The entries in this table are binary (the file has the mutation or not), the table looks like this:
```
               mutationID1     mutationID3     mutationID3
patient1file1     1               0               1
patient1file2     1               0               1
```
Create phylogenetic trees
1. We loaded this .csv file into R and created several phylogenetic trees, some with all samples of all patients, some with all samples of one patient. The R code is in short something like hclust(dist(dataframe, method="euclidean"),method="average")

Now we would like to score these trees to experiment a bit with our filter steps and tree creation methods. Do you have any ideas how to score such trees?

If any further information is needed I'm happy to provide it. I am a student and this is my first post here as well as my first time working with ngs-data, so please bear with me.

SNP whole-exon next-gen Phylogen-tree R • 3.4k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by ronald.findling ▴ 10

Ram · Answer 1 · 2015-01-30

1

Entering edit mode

9.8 years ago

Brice Sarver ★ 3.8k

Your tree is not phylogenetic; it's a dendrogram that merely represents clustering.
What do you mean by 'scoring' trees? In true phylogenetics, you often compare trees estimated under different models of nucleotide or amino acid sequence evolution or different statistical approaches. By using different distance methods passed to hclust(), you'll get different groupings with the caveat that distances are estimated in different ways.

If you want to truly estimate a phylogenetic tree (i.e., a tree that describes the evolutionary pattern of ancestry), you would be best off correcting genetic distances/estimating under models. If you have variants in a VCF, a first-pass method to look at would be at RAxML: an approximate likelihood approach that handles large datasets well and in a parallel fashion.

Hope this helps.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

To my understanding if the molecular clock assumption is fulfilled a dendrogram is a simple phylogenetic tree, is this incorrect? I'm aware that the molecular clock assumption might not always be fulfilled in cancer, this was just the best I could come up with at the moment.

I want to try different filter methods and parameter, as well as different hierarchical cluster algorithms. To compare the results a way to determine the chance for each individual tree to be correct / represent the data-set best would come in handy.

Thanks for the tip with RAxML, I will try it out and let you know.

ADD REPLY • link 9.8 years ago by ronald.findling ▴ 10

0

Entering edit mode

Clustering based on overall similarity says nothing about the evolutionary relationships among the taxa of interest - just how similar they are. These were popular in one subfield of systematics where there was an initial assumption that we can't know anything about the 'actual' evolutionary history of organisms; such trees were called phenograms. A dendrogram simply represents a tree-like branching structure sensu lato.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Brice Sarver ★ 3.8k