I have to group DNA sequences according to similarity, and create a non-redundant (NR) database from it.
In the first attempt I start creating the NR database with the first sequence, and created a database with already added sequences (redundant). Before adding the next sequence, I did a BLAST against it to check whether the new sequence already exists in the database. This gave me 52 results from a total of 85.
blastn -db dna.fasta.db -query temp.fasta -evalue 1e-3 -max_target_seqs 1 -outfmt '6 qseqid sseqid sstart send evalue'
If this has a result, the sequence is ignored.
On the second attempt I used blastclust. As I've read, I should get the same result. I used the same e-value in the config file
-e 1e-3
With this command, but I obtained 71 clusters (I expected 52) from a total of 85 sequences.
blastclust -i known.numbered.fasta -o known.numbered.fasta.cluster -p F -c config
Am I missing anything from balstclust? Documentation is very vague.
I think blastclust has length coverage threshold (default = 0.9).
tried with that but the same
The same is also strange. I expected that the number of clusters might not be 52 but would be less than 71 if we remove length coverage threshold. Anyway, I have no idea except for the extra options of blastclust.