I'm currently trying to work with a set of plant nucleotide contigs that are all greater than 300 bp in length. I'm trying to use NCBI blast to identify the contigs, but I'm running into a problem because my outputs tend to be flooded with ~50 bp long sequences in my contigs that match with random database sequences that definitely aren't real. For example, my last blast returned a large amount of ~50 bp sequences that matched with zebrafish, human, C. diff, and others. I upped the E value to be extremely strict (0.3), but I'm still getting these artifacts, and I feel like I'm losing a lot of quality results because I'm not working with a model system. I've been searching on Google and through the help file for a way to filter these short matches, but I haven't found anything. Does anyone have any advice on how I could clean these output files up?
0.3 is not a strict e-value. Try something like 1e-50. That could be considered strict.