I used to select the first several sequences as the ideal sequences I want to get. But there are cases that a lot of sequences with high scores and low evalues will retrieve after blasting. When it happens I was suggested to pick up some sequences randomly and exclude the sequences from same organism. Is it ok? Or Does any other better solutions exist?
Thanks for your useful answer. when blasting some short sequences such as 16s RNA of bacteriums, the results can be that all the sequence retrieved with indentical E-value and score. If I want to build phylogenetic trees, what should I do with the sequences?
I would analyze the whole set manually, for example: 1) align ALL sequences, 2) group by sequence similarity, most of sequence alignment programs do that by default 3) for every group of sequences referring to the same GeneID take one, with a length most close to average length of seq in the alignment, and without regions clearly dissimilar to other seqs in the alignment 4) for every group of nearly identical sequences from the same species, take one only from representative strain (e.g. from all E.coli strains take only those from K12 strain).