Hello everyone,
I'm a new user of Maker and I'm seeking assistance with the protocol I'm using. Currently, I'm working on annotating the genome of a non-model ascomycete fungal species belonging to the Sporocadaceae family.
After running the analysis with Maker, I obtained FASTA and GFF files using fasta_merge and gff_merge, respectively. Following this step, I renamed my outputs using maker_map_ids (for .all.id.map), map_gff_ids for GFF files, and map_fasta_ids for both protein and transcript FASTA files.
Now, I'm at the stage where I want to blast the predicted proteins against a database, and here arise my initial questions:
Can I download the database from UniProt? Should I download the entire SwissProt protein set, or just SwissProt with the Fungi category selected? Or should I download only Fungi, but all fungal proteins available on UniProt? Despite these uncertainties, I decided to move forward to ensure that my pipeline works correctly. I performed a blastp analysis of the proteins obtained after the Maker analysis, using a protein database downloaded from SwissProt (Fungi category only).
I used this command:
blastp -db swissprot_fungi.fasta -query MYGEN.proteins.fasta -outfmt 5 -evalue 1e-5 -out MYGEN.proteins.xml -num_alignments 5 -num_threads 24
At this point, my second question arises. Is there a Python code or a similar tool that can combine the two files, MYGEN.proteins.fasta and MYGEN.proteins.xml, and return a GFF3 file?
In other words, I'm looking for code that does something like this:
python blast2annot.py -i MYGEN.proteins.fasta -b MYGEN.proteins.xml
Thank you very much for your help.
SwissProt fasta: https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz
Trembl fasta: https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_trembl.fasta.gz
Complete set.
Thank you so much for your responses.
GenoMax , I've downloaded both of the databases you recommended. I attempted to combine them into a single FASTA file and created a database for the blastp search using makeblastdb. However, the resulting FASTA file was so large that despite running the blastp analysis on my server for several hours, no results were generated.
I also tried using makeblastdb with just the SwissProt dataset containing fungal sequences, and the analysis worked.
So, I assume that the analysis speed I'm experiencing depends on the size of the database I'm creating with makeblastdb.
That being said, I will try downloading the files in dat.gz format from the link Mensur Dlakic suggested.
At this point, here's the plan:
makeblastdb -in merged_db.fasta -dbtype prot.
Is this procedure correct?
I think this will work. For .dat -> .fas conversion it may be helpful to download a utility called
esl-reformat
which is a part of the HMMer package:http://hmmer.org/
I ran blastp with the following command:
I ran the analyses for 50:30 hours but I was not able to terminate the analyses for the whole set of protein.
Does anyone have some suggestion about how to speed up this process? Does the -num_threads option can help me increase the speed of this analyses??
Thank you so much for your help!
It should be faster, though you are not telling us how many proteins are in your query file.
I suggest you use as many threads as available. It helps if you run it on a fast computer. Finally, I suggest you try this with a single sequence to make sure everything works as intended.
I have 13,576 proteins in my file.
I tried on a small set of 3 proteins and it worked fine!