I have ~1.2 million 454 reads and I want to cluster them according to their DNA sequence (eg those that have at least 90% identities over at least 70% of their length)... I know that at least for smaller datasets (a few thousands of sequences) blastclust works good.
What happens though, if you have hundreds of thousands or even millions of sequences? What program(s) do you use?
I tried blastclust but it's been running for more than 4 days and it's not printing any progress message so I have no idea how long will it take...
I also tried what the authors of the CANGS pipeline suggest but mafft-distance creates a way too big distance matrix (for ~50,000 sequences it has reached ~240GB)!! Even if this is normal I don't have that much hard drive free space to store the file!
I think you can't do that, because a distance matrix space requirement is quadratic over #objects: if I didn't calculate wrong (assuming a float/double = 32bits) you would need: (32 bit * (1.2e6^2)/2)/(8bit/byte * 1024^3) = 2682.209 GB! of memory. There could be a different approach where you don't have to calculate the whole distance matrix ahead of computation out there. At least you cannot hold it in memory.