I have a protein domain of interest.
I want to search for a standalone protein with only that domain as its majority length.
I can think of two methods doing the job:
Method 1: blastp with all nr sequences -> grep result within desired length -> my result
Method 2: grep nr sequences within desired length -> build blast database -> blastp -> my result
I prefer method 2 because I think if I use method 1, the overwhelming number of hits that are not within the desired length will wipe out all the hits I want.
Of course it is easy that I just test the two methods, I just want to know
1) How do you compare the two methods?
2) When will the blast program stops giving out hits if there are too many (for web server and standalone)?
Thank you.
I would go for option 1. much more unbiased and I think quicker then the subsampling approach.
To get all the hits you want be sure to set
num_alignments
ormax_target_seq
high enough to get all the hits you want , depending no the input you might also consider raises the e-value thresholdas for you second part of the question: it will stop outputting if either of the thresholds I mentioned above are reached
Good point to consider bias! I want to know if blast will anyway output all results within my set constraints, how would
num_alignments
andmax_target_seq
affect if I will get all the hits I want?theoretically you can set them up to the number of entries in your database, but normally you should not go to that extreme I think.
Running standalone blast with such a small input should not take too long, so you might have the opportunity to try a few values for those parameters