Hi all,
I'm a PhD student with some experience working with transcriptomic and epigenomic data, but I'm new to phylogeny reconstruction. I've been tasked with making a phylogeny for a set of accessions of a crop species, but from what I understand the data types may not be compatible.
I have 233 accessions with ~14x whole-genome short-read sequencing data and, from a collaborator, 297 accessions with GBS data. More details:
- The short read data is ~60M reads of Illumina 150 bp PE reads per accession.
- The GBS data comes from a ddRAD-based library preparation using the enzymes PstI and NlaII. The digested DNA was adaptor-ligated and then PCR amplified.
- The size of the reference genome is approximately 600 Mb.
I can easily derive a Variant Call File (VCF) for both data types, and I think these can be merged using "bcftools merge" - though of course there will be missing data for the majority of SNPs detected from the WGS data.
However, I have been advised by another student that this merged VCF may not yield a valid phylogeny (after filtering and conversion to e.g. PHYLIP) because the two genotyping methods have inherent detection biases that could distort the results. An extreme outcome could be that the two methods produce - falsely - entirely separate clades.
Is this something I should be worried about? If so, is there any set of computational corrections that can be applied to account for the biases of these different methods?
Many thanks, Max
Hi dthorbur,
Thanks so much for your reply! The purpose of the phylogeny is a little convoluted, but it is not for a study with a deep evolutionary focus. Essentially, the population with GBS data is a USDA gene bank collection that is readily and globally accessible to researchers, while the WGS data represents a collection that is not accessible outside it's host nation. I wish to find a set of USDA accessions that represent the major clades found within the WGS accessions. We intend to construct a pangenome resource based on these target accessions.
Your idea of selecting a group of loci with similar depth distributions sounds good, that is definitely something I will try. What order of magnitude would you suggest a 'handful of loci' should be? A hundred per chromosome? A thousand? Unfortunately my study organism doesn't have established sets of loci used for population analyses, so I don't think I can follow-up on that angle.
If you have any further advice I'd really appreciate it!
Thanks again, Max