Question

Some Questions About Using Orthomcl To Find Orthologs Within Many Species

3

Entering edit mode

13.0 years ago

User 7478 ▴ 30

When I follow the OrthoMCL User to do my work, I use orthomclAdjustFasta to produce a compliant fasta file, and each protein in the file have a definition line in the following format: >xxx|yyyyyyyy. But when I run

blastall(blastall -i ALL_goodProteins.fasta -d BLL_goodProteins.fasta -p blastp -e 1e-10 -m 8 -o A-to-B.txt), there are some error reports like these: [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta stop(449) >= len(367) [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta start(450) >= len(367) [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta start(459) >= len(367) [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta start(531) >= len(367)

---I think maybe all sequences of "BLL|yyyyy" or "ALL|yyyyyyy" are saw as repeat ids.

So, then I use uncompliant fasta file(each protein only has a definition line >yyyy) to do NCBI BLAST -m 8. While when I input my blast results to orthomclBlastParser, I only got a vacant file named similiarSequences.txt.

Anyone can help me? Thank you very much!

orthomcl fasta conversion • 6.0k views

ADD COMMENT • link updated 12.7 years ago by Damian Kao 16k • written 13.0 years ago by User 7478 ▴ 30

score 7 · Answer 1 · 2011-11-21

7

Entering edit mode

13.0 years ago

Damian Kao 16k

Does your header line have spaces? Blast reads anything up to the first space as the blast ID. If you have two entries that have the same name up to the first space, it can cause the error you described. For example:

>SMA|ID 02919
XXXXXXXXXXXXXXX
>SMA|ID 02399
XXXXXXXXXXXXX

For blast, both of those sequences would have the same ID, " SMA|ID", causing an error.

ADD COMMENT • link 13.0 years ago by Damian Kao 16k

1

Entering edit mode

OrthoMCL really just needs those first three characters for it to distinguish between the two datasets when you do the all vs all blast. Whatever is after the 'XXX|' is the just the ID of the sequence in the data set which needs to be unique and without spacing for the blast to work. So if you just reformat your fasta files so there is no spacing in the ID field, it should work.

ADD REPLY • link 13.0 years ago by Damian Kao 16k

0

Entering edit mode

Yes,I think that might be the reason. But the "orthoMCL User" tells me "each protein in those files must have a definition line in the following format: >xxxx|yyyyyyyy ", or else I can not do next steps such as orthomclBlastParser

ADD REPLY • link 13.0 years ago by User 7478 ▴ 30

0

Entering edit mode

Thank you very much! Each of my original sequence ID contains a space and when I remove it, I can do blastall successfully!

But I have another problem. When I do orthomclBlastParser like this: orthomclBlastParser Hsa-Ath.txt Ath >>similarSequences.txt
-----"Hsa-Ath.txt" is the BlAST output in m8 format. -----"Ath" is the directuory of compliant fasta files as produced by orthomclAdjustFasta

But it tells me "couldn't find taxon for gene '2_Ath.fasta' at /opt/bin/orthomclBlastParser line 103, <F> line 1."??? Could you help me?Thank you!

ADD REPLY • link 13.0 years ago by User 7478 ▴ 30

0

Entering edit mode

I have a similar problem at Blast Gives Cryptic Errors but I don't see any spaces.

ADD REPLY • link 12.0 years ago by hbw ▴ 90

0

Entering edit mode

if u wanted to use orthomclAdjust fasta on this you would want to 3 for the location of the ID because that script interprets spaces and line brake characters in the header as field separation... unless you want to keep whatever word is in the place of ID then you would want to remove the space between ID and 02919

ADD REPLY • link 11.3 years ago by sburlce • 0