I need to do a phylogenetic analysis of 300 sars-cov-2 samples, but it is being challenging due to the enormous GISAID dataset (> 500k genomes). I removed sequences that do not encompass the temporal window of my samples, reducing the dataset to ~350k genomes. Even so, nextstrain pipeline and genome-sampler (https://caporasolab.us/genome-sampler/intro.html) are crashing, given that I only have 64 gb of ram available.
Given that, I am willing to adjust my analysis to my computational resources. GISAID provides metadata for all genomes, and I am thinking to subsample GISAID dataset considering date, country and pangolin_lineage.
My main problem is that I do not know how many genomes I should selected for a computer having 64 GB of ram. Another question is how many genomes per stratum I must have to achieve significant results.
Could you please give me some ideas? Thank you very much.