blastcust results understanding

0

Entering edit mode

7.6 years ago

juan.crescente ▴ 110

I have to group DNA sequences according to similarity, and create a non-redundant (NR) database from it.

In the first attempt I start creating the NR database with the first sequence, and created a database with already added sequences (redundant). Before adding the next sequence, I did a BLAST against it to check whether the new sequence already exists in the database. This gave me 52 results from a total of 85.

blastn -db dna.fasta.db -query temp.fasta -evalue 1e-3 -max_target_seqs 1 -outfmt '6 qseqid sseqid sstart send evalue'

If this has a result, the sequence is ignored.

On the second attempt I used blastclust. As I've read, I should get the same result. I used the same e-value in the config file

-e 1e-3

With this command, but I obtained 71 clusters (I expected 52) from a total of 85 sequences.

blastclust -i known.numbered.fasta -o known.numbered.fasta.cluster -p F -c config

Am I missing anything from balstclust? Documentation is very vague.

blast blastclust • 1.6k views

ADD COMMENT • link updated 7.5 years ago by Biostar 20 • written 7.6 years ago by juan.crescente ▴ 110

0

Entering edit mode

I think blastclust has length coverage threshold (default = 0.9).

ADD REPLY • link 7.6 years ago by fishgolden ▴ 520

0

Entering edit mode

tried with that but the same

ADD REPLY • link 7.6 years ago by juan.crescente ▴ 110

0

Entering edit mode

The same is also strange. I expected that the number of clusters might not be 52 but would be less than 71 if we remove length coverage threshold. Anyway, I have no idea except for the extra options of blastclust.

ADD REPLY • link 7.6 years ago by fishgolden ▴ 520

Login before adding your answer.