I have about 10,000 individual DNA sequences in fasta format- both as a single large multifasta file and as separate files per sequence so I can be flexible with the input format. I should be able to work with fasta format though and am not looking to compromise there.
I am looking for a way to graphically display these sequences via cluster plot. I am generally unfamiliar with matlab and R, but familiar enough to know that they shine at graphical outputs but tend to rely on numerical input (like .csv files). I can't figure out how to use the R package hclust, for example, with my fasta file(s). This might just be because I don't know R very well.
A tree would be fine too, but it is extremely computationally heavy to align these sequences prior to putting them into a tree-making tool like RAxML. I have tried to mafft align all of these sequences on a remote server and the job timed out. In addition, I think a cluster plot is more visually appealing than a tree. However, if all I can get is a tree that is better than what I have now!
Simply stated, the goal is to quickly see how many rough clusters of DNA sequences I have. Thanks for any help you can provide. -Rob
See also: Hierarchial Clustering
And the Muscle manual on large alignments: http://www.drive5.com/muscle/manual/bigalignments.html
You would need a distance matrix somehow, however as you experienced, a multiple alignment of 10000 sequences is computationally heavy. CD-hit might be your best bet. I would forget about plotting such dendrogram for now, as such large dendrograms are not very useful in my opinion.
What kind of sequences are they? 16S?