My solution:
Main file: all.fasta
Unique sequence file, generated from main file: unique.fasta
gt sequniq -o unique.fasta all.fasta
R function to find the similar sequence
funiq<-function(all.fa,uniq.fa)
{
all<-unlist(all.fa)
sequ.names <- names(uniq.fa)
sequ <- NULL
sequ[sequ.names] <- list(NULL)
for(i in 1: length(uniq.fa))
{
sequ[[i]]<-names(all[which(all%in%(uniq.fa[[i]][1]))])
}
return(sequ)
}
Read FASTA file in R
library(seqinr)
all<-read.fasta("all.fasta",as.string = TRUE,seqtype="AA")
uniq<-read.fasta("unique.fasta",as.string = TRUE,seqtype="AA")
Name of similar sequences:
nam<- funiq(all,uniq)
Result:
$ADC37925
[1] "ADC37925" "EYO75956" "EVE92773"
$AFR73793
[1] "AFR73793" "EPZ11191" "EPZ09632" "EQM92926" "EOR47863" "CAQ50233"
$EFB95474
[1] "EFB95474"
Frequency count:
fcount<-lapply(nam,length)
Result:
$ADC37925
[1] 3
$AFR73793
[1] 6
$EFB95474
[1] 1
I think it's not a clustering algorithm !! its more about frequency distribution of each unique sequence.
i.e. cluster size at 100% identity
This is a MUCH better option.