I want to cluster a set of ncRNAs in human into different families based on sequence similarity. How should I do it and which software might be suitable to to it? Thanks very much!
I want to cluster a set of ncRNAs in human into different families based on sequence similarity. How should I do it and which software might be suitable to to it? Thanks very much!
You might want to map your ncRNAs to the genome and then cluster based on overlap, using something like blockbuster.
I'd build HMMs/CMs from alignments of the RNAs. Then iteratively add similar sequences to the largest clusters until you're done.
E.g. run hmmbuild -> hmmsearch -> hmmalign -> (predict a secondary structure w/ e.g. RNAalifold) -> make a stockholm alignment w/ secondary structure annotation -> cmbuild -> cmsearch -> cmalign. Repeat last 3 stages until convergence. Repeat with unaligned sequences.
In general, the answer to "how do I cluster sequences?" is CD-HIT.
In this case, specifically CD-HIT-EST. From the applications page:
CD-HIT-EST has been used in clustering many types of sequences such as Expressed Sequence Tags (ESTs), MicroRNAs (miRNAs) (RNA, 2007 13:170-187), oligonucleotide probes (Bioinformatics, 2007 23:1195), 16S rRNA sequences (Nature, 2009, 457:480).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks! Actually what I meant was that I wanted to cluster RNA into families. Sorry for not saying clearly.