I'm having a problem with orthoMCL at the "orthomclPairs" stage; it is being caused from an earlier action between the BLAST and relational database step. Here is a recap of the steps for anyone familiar with the program but needed a reminder:
- BLAST database built with all "goodProteins"
- BLASTall using "goodProteins" as a query against the database. Output is in specified format.
- orthomcl schema installed into a mySQL database
- orthomclBlastParser takes blast output and makes new file for mySQL called similarSequences.
- orthomclLoadBlast takes new file and creates table in mySQL with it called SimilarSequences.
- orthomclPairs takes the first table and creates new tables.
The point all my runs fail is at orthomclPairs, where I get a duplicate key message:
~/orthomcl/test1$ DBD::mysql::st execute failed: Duplicate entry 'l3a|maker-scaffold_5922-exonerate_est2genome-gene-0.13-mRNA--the' for key 'best_hit_ix' at /share/apps/orthomcl-2.0.9/bin/orthomclPairs line 709, <F> line 15.
I've been troubleshooting that for a few weeks but it seems like what others have posted doesn't work for me. The 'best_hit_ix' is a table being made. Here is a short list of what I thought was a problem but did not end of fixing it.
1) Looking for a fix for this error directly led me to two discussions that seemed to just drop rows in the SimilarSequences table that were not unique:
http://seqanswers.com/forums/showthread.php?t=13883
http://seqanswers.com/forums/showthread.php?t=42919
This didn't seem difficult so I renamed the original table to keep as a backup and made a new table from this with their fix, in mySQL:
> rename table SimilarSequences to similarSequences;
> create table SimilarSequences as select distinct * from similarSequences;
Unfortunately I ran into the same problem again, so it seemed my issue wasn't identical.
2) I thought the SimilarSequences table had multiple query-subject combinations (which I thought was ok because there are multiple places in each protein that are found to align) and that it had to be removed. So I thought I would use mySQL to only keep distinct combinations of the subject and query:
create table test1 as select distinct query_id,subject_id from similarSequences;
select * from test1 group by query_id,subject_id having count(*)>1;
This also didn't make the program happy. I got an error message when I ran orthomclPairs.
3) Maybe the genome I am working on had deflines that made orthomcl unhappy. I subbed all dashes for underscores and reran the entire pipeline. Did not fix it.
4) I thought I should try modifying the BLAST procedure since I am more comfortable with controlling that. I modified all the arguments so it was similar to the blastall script, but in blastp instead. Using blastp, I can set a maximum for hsp which would be the equivalent (I thought) of limiting a single combination of query to subject. Here is the BLAST script the authors give:
-F 'm S' -v 100000 -b 100000 -z protein_database_size -e 1e-5
Here is the first blastall script I worked with:
blastall -p blastp -d goodprotdb -F 'm S' -v 100000 -b 100000 -z 36146 -e 0.01 -m 8 -i goodProteins.fasta -o blastall.out
And here is the blastp script I made which should give me the same format the orthomcl pipeline needs:
blastp -db goodprotdb -max_hsps 1 -num_descriptions 100000 -num_alignments 100000 -dbsize 36146 -evalue 0.01 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -query goodProteins.fasta -o blastp.out
I also tried modifying the original blastall recipe with two variations, K and P:
blastall -p blastp -d goodprotdb -K 1 -F 'm S' -v 100000 -b 100000 -z 36146 -e 0.01 -m 8 -i goodProteins.fasta -o blastpK.out
blastall -p blastp -d goodprotdb -P 1 -F 'm S' -v 100000 -b 100000 -z 36146 -e 0.01 -m 8 -i goodProteins.fasta -o blastpP.out
These options stand for
-P 0 for multiple hit, 1 for single hit
-K Number of best hits from a region to keep.
I couldn't determine completely from online resources if these options were what I thought they were, but if duplicate entries were causing the failure, I thought I would at least do the blast and compare them to the original. The plain 'blastall' and 'blastall' with -P are done, and I don't think that made a difference because they are the same size. The 'blastall' with -K and 'blastp' are still running, and the file size for that is about half, so I know it is trimmed down.
What I am unsure of about this approach is if I am going to run into problems with the results. I could not determine for sure if the multiple hits from each protein-protein pair were used in the calculation of percent match in the orthomclBlastParser program, or if only having a single result for each match was sufficient to run the program. I looked at the code for this script and couldn't decide, so I thought I would at least try the variations and see if one allowed the program to reach the end successfully.
Feedback/ideas/suggestions all welcome, I've been working on this for a few weeks and our university help desk is also not sure how to fix it.
Save yourself a lot of hassle and just switch to using Roary instead.
Hi jrj,
Thanks for the idea, I have not heard of that program. Briefly looking at the program, I don't think it would be an option for this project. I am working on microbial eukaryotes, and this relies on using prokka for starting the pipeline. Annotation for Eukaryotes is more challenging, and I am working on organisms that are divergent from what is well-known in databases. prokka is best for bacterial and archaeal genomes, and would miss a lot of gene calls for me, especially alternate transcripts.
However I am doing a pangenome of archaea and this might be a great tool in addition to anvio, so I think i'll look into it a bit more. Thank you!
Ah yeah, didn’t realise you were using eukaryotic data, no worries.