After getting say 100k transcripts from an rna-seq project, generally one wants to annotate them against a database like nr, using say blastx. Problem is, this is very slow, taking e.g. a week with 24 CPUs. What have people done to overcome this?
After getting say 100k transcripts from an rna-seq project, generally one wants to annotate them against a database like nr, using say blastx. Problem is, this is very slow, taking e.g. a week with 24 CPUs. What have people done to overcome this?
I use GNU Parallel to run several BLAST jobs at once with each job getting one CPU, have a look here: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
That way it should only take a day or so.
Check out those guidelines: http://trinotate.sourceforge.net/ . In some cases blastp + domain prediction are quite enough for annotation.. By the way, have you considered using Cloud services?
Another possibility can be reducing the database size you search in. Instead of taking complete NR database, you can take species which are very near in the tree, and also some what distant species sequences which are comprehensively studied and have substantial information such as human, mouse etc.
This way it reduces the search space in magnitudes, and most of your sequences should get annotated. But, there are also chances you might not be able to annotate a small fraction of your 100K transcripts.
~Rama.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Expanding on this answer: Have a look at SwissProt, it's manually curated so you get less noisy results, but it's also much smaller than nr, so you'll get less results in much faster time.