Hello, I'm looking to do a large-scale phylogenetic analysis. I plan to build a PCA plot with 60,000+ DNA sequences. I'd be doing a beta-diversity analysis with one sample comprised of 60,000 sequences while the other three are <200 sequences. I want all of the individual sequences to be included in the plot, rather than datapoints representing the complete samples.
I've been looking at Parallel-Meta and Qiime. Does anyone have any other suggestions? I'd be running it on a 16 GB RAM, 8 thread environment.
Thanks, Peter
What is the size of the individual sequences? If the sequences are redundant then there is no point in using all of them as is.
The length would be ~1,000 bp. I was planning to get rid of redundancy so I'd reduce the sample size but my guess is the dataset would still 10,000-20,000.