The setup: I have a large number of sequences from a viral pathogen and the associated epidemiological data, collected during a major disease outbreak. I deeply suspect that the varied epidemiology seen across the outbreak (case severity and outcomes, transmission rate, etc.) is the result of changes in the viral sequence.
The question: So, how do I best correlate these epidemiological data with sequence data? In the crudest sense, how can I point at a SNP and say "this is associated with more severe cases"?
Complications:
I'm concerned about phylogenetic inertia, i.e. false correlations caused by evolutionary relationship. A given sequence change may correlate with increased fatality because it was fixed in the lineage that infected a weakened group of hosts.
Some characteristics which are technically non-heritable will behave as heritable, e.g. location.
Solutions I've considered:
Tools from GWAS studies or similar: apart from the possible overkill of using these on such a short genome, I don't know of any GWAS tools that deal with the inertia problem..
Comparative analysis with independent contrasts: would be the obvious choice if I was dealing with solely character data. I could hack an suitable dataset together, say by treating a SNP loci as a character, but it seems ugly. Also, the state of useful software here is not good.
Selection: will tell me what sites are being selected for but not what might be correlated with that selection.
Compare controls: is something I've done before, but in this case it seems that deciding what to control for is pre-emptively deciding what won't correlate.
Exactly what kind of epidemiological data do you have, e.g. is it already aggregated by viral sequence, or do you have individual case data at your disposal?
Individual case data, dates, outcomes, the whole paella.