Hi all
I think I already know the answer, but I am finding it impossible to find a concrete answer online anywhere. And I want to ask anyway.
I am running a eukaryotic proteome (15mb, >30,000 sequences) through the most recent NCBI nr database (370G unzipped) locally using blastp. Thus far it has been running 8 days, with no end in sight. Everything is running as I can see it from the system monitor, and the output file is slowly filling up (15MB sofar, after 8 days).
I am not running this on a powerful university computer. My home desktop is a Unix environment (Linux Mint) 16G Ram, quadro-core machine.
Is this normal ?(I know, I have read many places that at least 50G RAM may be needed for this to be completed in about an hour). What further time frame might I be looking at here? Should I stop it and beg some university to allow me use their systems for a day?
I really don't want to kill it after 8 days thus far, the info that will come back (if I don't kill it) is very important to me. But I cannot wait weeks.
Thanks in advance
Thanks, good to hear Diamond is a viable alternative for me. I have this loaded up and ready for use as I was already thinking of using it. All my output files will go through Alienness, and Diamond is an option for that. I knew as soon as I pressed "go" on local blastp search I would regret it. Cheers
Less than a month!! I have ten more proteomes to run!! I've killed it. Still, there was enough information in the output file to tell me I am on the right track. I am going to try the Diamond aligner. Thanks for replying with advice guys.
What is your justification for using NR as your reference database, i.e. why do you need to compare your eukaryotic proteome against "all known" protein sequences? What is the question that you're trying to answer?
You are going to have the same exact problem with
DIAMOND
if you are planning to usenr
as your reference.@5heikki This work is part of a HGT discovery pipeline, so sort of need to run through the nr database. Standard practice for extrinsic HGT characterization. Genomax is right, diamond is as slow as blast+. On my machine anyway. Need a new machine.
Bacteria_forever : Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.HGT specific to your organisms? You could reduce your proteome a lot by excluding all the proteins which get a good hit to the proteome of its closest sequenced relative, no? If you want something that is many orders of magnitude faster than blast then check out Mash. Creating a reference database would take quite a while thou. I recently did a Mash all vs all of ~210k bacterial genomes. This took about 24h with 128 threads..
Thanks for all the advice guys. Very useful.