Entering edit mode
5.4 years ago
Annie
•
0
I want to do clustering (k-means) and redundancy removal of my FASTA sequences which are mainly GSS, EST and assembled transcripts, to create a reference set for my short query sequences. My short query sequences can target either DNA or RNA. So I need some expert guidance. Also should I convert lower case base sequences into upper case for doing this task. Any suggestion would be highly appreciated.
You should also look at CD-HIT which is specifically tailored for this type of application and has specific subprograms.
Thanks for your answer genomax, but I have found uclust to be better than CD-HIT
You might look at
dedupe.sh
fromBBTools
. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/