Question

InterProScan standalone - error connected to non-unique fasta identifiers

0

Entering edit mode

8.1 years ago

al-ash ▴ 210

Problem: My InterProScan with nucleotide fastas as input among which there are multiple fastas with non-unique names consistently returns no output

Description: I'm running InterProScan (InterProScan-5.21-60.0) search on linux in standalone mode. In test searches, when I'm looking only for GO terms and search Pfam database and I use the test multifasta provided in InterProScan package (test_nt_redundant.fasta) which includes also some fastas different in sequence but with non-unique names (see below), the analysis runs without any problems.

interproscan.sh -i test_nt_redundant.fasta -b output -goterms -appl Pfam  -t n

The fasta headers in the test file are:

>A2YIW7
>Bob
>ENA|AACH01000026|AACH01000026.1 Saccharomyces mikatae IFO 1815 YM4906-Contig2858, whole genome shotgun sequence.
>ENA|AACH01000027|AACH01000027.2 Saccharomyces mikatae IFO 1815 YM4906-Contig2858, whole genome shotgun sequence.
>Henry
>reverse translation of P22298
>reverse translation of P22298
>Wilf

However, when I run the same analysis with a set of 15 fastas which I'd like to annotate and which contains also some fastas with non-unique identifiers, I'm consistently receiving following massage and interproscan ends without any output:

Found 3 non unique identifier(s). These identifiers do have different sequences, within the FASTA nucleotide sequence input file.
    Please find below a list of detected identifiers:
    100646091
    100646573
    100645787
    InterProScan will shutdown, because there is no way to map nucleic sequences and predicted proteins.

Remarkably, even the returned list of non-unique identifiers is not complete. (see below for the list of fasta headers in the 15fasta set):

Additionaly, when I remove from the 15fasta set the non-unique fastas, the analysis runs without any problem - so I guess the problem is somehow connected to the number of non-unique fasta identifiers in the input.

I'm wondering what might be the source of this error and how to solve it? Thanks in advance for any hints.

InterProScan nucleic acid non-unique identifier • 2.1k views

ADD COMMENT • link 8.1 years ago by al-ash ▴ 210

1

Entering edit mode

Just add a unique identifier to non-unique fasta headers. It makes sense to stop on non-unique identifiers since if IDs are not unique, you wouldn't be able to unambiguously associate the results with a sequence.

ADD REPLY • link 8.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for the reply! I thought that InterProScan should be able to take care of the non-unique identifiers (e.g. by adding number suffix) but now I went through the manual once again and indeed it is not (https://github.com/ebi-pf-team/interproscan/wiki/ScanNucleicAcidSeqs). The reason, why it did not return error with the sample set was, that the two sequences with identical identifiers in this multifasta has also identical sequences in which case InterProScan just merges it into one sequence.

ADD REPLY • link 8.1 years ago by al-ash ▴ 210