As described in the title I would like to score or validate some phylogenetic trees we created. To clarify the current state I have to give you some background information:
We are trying to analyse inhomogeneity in cancer samples of not metastasised colon cancer. We have 5 patients with 6 samples by patient (5 cancer samples, 1 blood sample). Sequencing, variation calling and adding additional annotation was done by an external institute using GATK and snpEff. We received .bam, .vcf and .tsv files for every sample, where the .tsv files have pretty much the same information as the .vcf files.
We decided to go with the .tsv files and have done the following steps up to now:
- Filter
- Filter based on read quality: FILTER="PASS" (bash script)
- Remove common mutations: db_snp.COMMON !=1 (bash script)
Create a table which file has which mutation (bash scripts)
- combine the following columns into an IDstring for each mutation: CHROM POS ALT
- Create the table/csv: rows=fileIDs; columns=muationIDs; The entries in this table are binary (the file has the mutation or not), the table looks like this:
mutationID1 mutationID3 mutationID3 patient1file1 1 0 1 patient1file2 1 0 1
Create phylogenetic trees
- We loaded this .csv file into R and created several phylogenetic trees, some with all samples of all patients, some with all samples of one patient. The R code is in short something like
hclust(dist(dataframe, method="euclidean"),method="average")
- We loaded this .csv file into R and created several phylogenetic trees, some with all samples of all patients, some with all samples of one patient. The R code is in short something like
Now we would like to score these trees to experiment a bit with our filter steps and tree creation methods. Do you have any ideas how to score such trees?
If any further information is needed I'm happy to provide it. I am a student and this is my first post here as well as my first time working with ngs-data, so please bear with me.
To my understanding if the molecular clock assumption is fulfilled a dendrogram is a simple phylogenetic tree, is this incorrect? I'm aware that the molecular clock assumption might not always be fulfilled in cancer, this was just the best I could come up with at the moment.
I want to try different filter methods and parameter, as well as different hierarchical cluster algorithms. To compare the results a way to determine the chance for each individual tree to be correct / represent the data-set best would come in handy.
Thanks for the tip with RAxML, I will try it out and let you know.
Clustering based on overall similarity says nothing about the evolutionary relationships among the taxa of interest - just how similar they are. These were popular in one subfield of systematics where there was an initial assumption that we can't know anything about the 'actual' evolutionary history of organisms; such trees were called phenograms. A dendrogram simply represents a tree-like branching structure ​sensu lato.