Problem: My InterProScan with nucleotide fastas as input among which there are multiple fastas with non-unique names consistently returns no output
Description: I'm running InterProScan (InterProScan-5.21-60.0) search on linux in standalone mode. In test searches, when I'm looking only for GO terms and search Pfam database and I use the test multifasta provided in InterProScan package (test_nt_redundant.fasta) which includes also some fastas different in sequence but with non-unique names (see below), the analysis runs without any problems.
interproscan.sh -i test_nt_redundant.fasta -b output -goterms -appl Pfam -t n
The fasta headers in the test file are:
>A2YIW7
>Bob
>ENA|AACH01000026|AACH01000026.1 Saccharomyces mikatae IFO 1815 YM4906-Contig2858, whole genome shotgun sequence.
>ENA|AACH01000027|AACH01000027.2 Saccharomyces mikatae IFO 1815 YM4906-Contig2858, whole genome shotgun sequence.
>Henry
>reverse translation of P22298
>reverse translation of P22298
>Wilf
However, when I run the same analysis with a set of 15 fastas which I'd like to annotate and which contains also some fastas with non-unique identifiers, I'm consistently receiving following massage and interproscan ends without any output:
Found 3 non unique identifier(s). These identifiers do have different sequences, within the FASTA nucleotide sequence input file.
Please find below a list of detected identifiers:
100646091
100646573
100645787
InterProScan will shutdown, because there is no way to map nucleic sequences and predicted proteins.
Remarkably, even the returned list of non-unique identifiers is not complete. (see below for the list of fasta headers in the 15fasta set):
>100645110
>100645230
>100645431
>100645550
>100645666
>100645666
>100645666
>100645787
>100645787
>100645973
>100646091
>100646091
>100646214
>100646573
>100646573
Additionaly, when I remove from the 15fasta set the non-unique fastas, the analysis runs without any problem - so I guess the problem is somehow connected to the number of non-unique fasta identifiers in the input.
I'm wondering what might be the source of this error and how to solve it? Thanks in advance for any hints.
Just add a unique identifier to non-unique fasta headers. It makes sense to stop on non-unique identifiers since if IDs are not unique, you wouldn't be able to unambiguously associate the results with a sequence.
Thanks for the reply! I thought that InterProScan should be able to take care of the non-unique identifiers (e.g. by adding number suffix) but now I went through the manual once again and indeed it is not (https://github.com/ebi-pf-team/interproscan/wiki/ScanNucleicAcidSeqs). The reason, why it did not return error with the sample set was, that the two sequences with identical identifiers in this multifasta has also identical sequences in which case InterProScan just merges it into one sequence.