Question

Phylogeny from mixed GBS and WGS

0

Entering edit mode

17 months ago

maxrwjones ▴ 60

Hi all,

I'm a PhD student with some experience working with transcriptomic and epigenomic data, but I'm new to phylogeny reconstruction. I've been tasked with making a phylogeny for a set of accessions of a crop species, but from what I understand the data types may not be compatible.

I have 233 accessions with ~14x whole-genome short-read sequencing data and, from a collaborator, 297 accessions with GBS data. More details:

The short read data is ~60M reads of Illumina 150 bp PE reads per accession.
The GBS data comes from a ddRAD-based library preparation using the enzymes PstI and NlaII. The digested DNA was adaptor-ligated and then PCR amplified.
The size of the reference genome is approximately 600 Mb.

I can easily derive a Variant Call File (VCF) for both data types, and I think these can be merged using "bcftools merge" - though of course there will be missing data for the majority of SNPs detected from the WGS data.

However, I have been advised by another student that this merged VCF may not yield a valid phylogeny (after filtering and conversion to e.g. PHYLIP) because the two genotyping methods have inherent detection biases that could distort the results. An extreme outcome could be that the two methods produce - falsely - entirely separate clades.

Is this something I should be worried about? If so, is there any set of computational corrections that can be applied to account for the biases of these different methods?

Many thanks, Max

VCF Phylogeny GBS WGS tree • 814 views

ADD COMMENT • link 17 months ago by maxrwjones ▴ 60

score 1 · Answer 1 · 2024-02-19

What is the purpose of this phylogeny? Do you need to use all variants detected across all samples? A phylogeny with 530 samples using WGS/GBS data will have enormous computational overheads, even for neighbor joining trees.

I agree with your colleague though. GBS and WGS have significant differences in variant detection capabilities since GBS is a form of reduced representation approach and mixing these sequencing approaches usually doesn't sound like a good idea without very careful consideration.

For the phylogeny, I would look at the distribution of coverage and find a handful of suitable loci with similar depth across both WGS and GBS samples. Ideally this will overlap with loci commonly used to differentiate populations in your model system. Then you can build gene trees or trees with variants from these equal depth areas. You could even pool a few of these equal depth loci for more accuracy at the cost of increasing computational overheads.

You could also visualise your candidate variant groups in a PCA before generating a phylogeny to see if samples are generally clustering in expected ways. But this step comes with potential inferential biases if you have captured something unexpected and can't explain the distribution based on your expectations.