Entering edit mode
7.6 years ago
wangdp123
▴
340
Hi,
Whenever using interproscan on about 10000 protein sequences, it will get stuck in "90% completed".
Could you please help me out about this?
Many thanks,
Regards,
Tom
Hi, please be more specific. What does stuck at 90% completed mean? Is there any message, did you check for running processes with top or ps? Which version of Interproscan are you using on which operating system and which infrastructure? Was the disk full? I assume that there were some of the long running tools still running when you interrupted the process. The time to complete 10k sequences can be several days depending on your machine size, and the % indication is not always a good estimate.
Hi,
Please have a look at the following log file when interproscan is running. The version of Interproscan is 5.23-62.0 and the Linux version is CentOS release 6.6 (Final). I used qstat to submit the job shell file and provided 16 cores for this job and I utilized the default interproscan.properties file provided by interproscan package. I am sure the disk is not full and I have tried many times with different numbers of protein sequences as input (5000,10000,30000) and when it reached the 90% complete, the program hung up there and none of new resultant files were generated after that time point. I speculate that it should finish within 5 days using 16 cores for no more than 30000 proteins but it didn't.
It is quite odd since I have checked with the server administrator that it is not a matter of maximum memory usage issue.
Alternatively, when I input 10 protein sequences with the same command line, it is working OK. I am wondering if there are two different mechanisms adopted in interproscan for a small number of proteins and a large number of proteins, which might lead to this issue?
Many thanks,
Tom
Sat Apr 22 04:48:49 BST 2017
22/04/2017 04:48:53:798 Welcome to InterProScan-5.23-62.0
22/04/2017 04:49:05:053 Running InterProScan v5 in STANDALONE mode... on Linux
22/04/2017 04:49:14:115 Loading file pep.fa
22/04/2017 04:49:14:135 Running the following analyses:
[CDD-3.14,Coils-2.2.1,Gene3D-4.1.0,Hamap-201701.18,MobiDBLite-1.0,Pfam-30.0,PIRSF-3.01,PRINTS-42.0,ProDom-2006.1,ProSitePatterns-20.132,ProSiteProfiles-20.132,SFLD-2,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0]
Available matches will be retrieved from the pre-calculated match lookup service.
Matches for any sequences that are not represented in the lookup service will be calculated locally.
22/04/2017 04:51:43:636 Uploaded/Stored 10799 sequences for analysis
22/04/2017 05:53:07:009 25% completed
22/04/2017 06:32:50:993 50% completed
22/04/2017 06:39:42:676 75% completed
22/04/2017 06:48:50:980 90% completed
Is
pep.fa
a single multifasta? Is it reaching 90% of a single file, or does it successfully analyse up to 90% of your proteins? Have you tried splitting the multifasta up and running 10,000 short jobs instead?Yes. pep.fa is a single multifasta file. I have tried to choose a smaller set of 5000 proteins to test the program but it come up with the same issue. There is NO any final result (such as tsv, gff and so on) generated till this step (90% completed) and there are something in the "temporary" directory only. Thus, I think no usable results for any proteins will come out unless it is 100% completed. I don't think chucking the 10000 sequences into 10000 independent single-fasta files is a good idea which means it will use interproscan 10000 times and I believe interproscan is designed to support multifasta file as input. What do you think?
Yeah it does seem silly. If Interpro supports a multifasta it should be capable of running on all of them. My only suggestion would be to try progressively larger datasets, working up from a number of proteins you know will work (maybe try 10, 100, 1000 and 5000 proteins) and see where it breaks. I would expect Interpro to write an error file if it is encountering any, but you could maybe consider redirecting the STDERR stream in to a file
2>file.txt
, in case it is throwing errors you arent actually seeing yet. Something else to consider might be that one of the protein fasta's in the file is invalid in some way? Perhaps run your multifasta through some other fasta parsers and make sure it behaves as expected.Please try the following:
get an interactive login on one of the hosts, then try the following commands with the protein file that come with ips:
and
both commands should terminate without error and provide test_proteins.csv ...xml, etc.
Then in case the program hangs again at 90%, check with
top -u username
which processes are running.Hi! Have you been able to resolve this issue? I'm in the same situation now, interproscan was working during couple of days in 10 threads, and now it's just one "java" thread, and it stuck on 90% for five another days, and it's still there.
I think that this could be the final summarization of results and mainly doing IO. Note that one normally measures the running time of interproscan in several weeks for a medium sized genome. So you just have to be patient.
Had the same issue with version 5.23-62.0. Tried it with version 5.24-63.0, there the run finished after some hours.
Maybe it is because of increased java max heap size (-Xmx parameter) in the interpro.sh script?