I have a large alignment file containing more than two thousand sequences, each one having more than 10 kb (sequences are viral genomes, and they have high identity levels, more than 99.9%).
Running a phylogenetic analysis (even a NJ with 1000 bs replicates) in a desktop computer does not seem feasible.
I must reduce the alignment data in order to ease this analysis, bearing in mind my computational resources (i7 with 8 threads, 32 mega of ram).
Since there is certain redundancy in the alignment, I could remove too much similar sequences.
Could you suggest me a strategy to pare down the alignment, but maintaining the diversity of sequences?
PS: My initial attempt was to group sequences using CD-HIT, using different identity levels. After that I constructed a graph correlating cut-off values with the number of groups formed. In certain cut-off level the number of groups reaches a plateau, which was the chosen criterion to maintain one representative sequence per group. However this approach did not reduced enough sequences, I still had too much data for phylogenetic analysis.