Question

Phylogenic Analyses using non-sequential sequences

0

Entering edit mode

9.3 years ago

sviatoslav.kendall ▴ 970

I have a data set that lists single-base mutations for a set of samples but I do not have access to the sequencing data. Each mutation in every sample is listed as a separate line in the file along with genomic coordinates, reference allele, affected gene and a number of other variables. All samples are from the same species. I want to run some phylogenetic analyses so I've constructed pseudo-sequences for each sample.

The lengths of the pseudo-sequences are equal to the number of distinct genomic coordinates in my single-base mutation list. If a sample lacks a particular mutation, the reference base occupies that position of the pseudo sequence. My pseudo-sequences file also contains a reference sequence which is composed of the reference base at each genomic coordinate seen in the single-base mutations file.

I want to better understand my options for phylogenetic analysis given that I haven't got biologically real sequences. I understand that some phylogenic tree-building methods compare sequence motifs of varying lengths and others treat each position as being completely independent of all other bases. Furthermore, I know that some methods, such as PHYLIP's DNA Maximum Likelihood (dnaml) require complete sequences despite the fact they treat each base change as an independent event; (in the dnaml's case this has to do with weighing the number of changes against the number of bases that haven't changed).

It seems to me that my best options for comparing these pseudo-sequences are distance-based methods like neighbor-joining of Fitch-Margoliash but as I am relatively new to phylogenetic analysis, I would very much appreciate any input on how I can compare these pseudo-sequences.

sequence • 2.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.3 years ago by sviatoslav.kendall ▴ 970

2

Entering edit mode

I wouldn't call them pseudo sequences, these are simply informative sites where variation is present. I did this in one of my papers, the genome assembly was a draft so I gathered all SNPs on different contigs from all samples and generated a synthetic haplotype in which if a sample was either reference or mutant in which case I would put the corresponding base pair. I constructed the alignment using R and then infered the phylogenetic relationship using conventional methods, ML, Bayesian, NJ.

ADD REPLY • link 9.3 years ago by apelin20 ▴ 490

score 2 · Answer 1 · 2016-01-09

As all of your samples are from the same species, you would be safe using maximum parsimony (MP) as it needs nothing more than parsimony informative sites (which you have, and it doesn't matter which order they are in).

While ML and BI etc. are more nuanced in that model selection can account for different rates of change and whatnot, if all samples are from within the one species I doubt that the time depth will generate enough homoplasy for explicit models to be necessary (you might have difficulty selecting an optimal model also). Your main worry in MP is homoplasy causing things like long branch attraction; within a species this is unlikely to be a problem.