Question

Assistance with Fungal Genome Annotation Using Maker and BLAST

0

Entering edit mode

2.2 years ago

Kinoppy • 0

Hello everyone,

I'm a new user of Maker and I'm seeking assistance with the protocol I'm using. Currently, I'm working on annotating the genome of a non-model ascomycete fungal species belonging to the Sporocadaceae family.

After running the analysis with Maker, I obtained FASTA and GFF files using fasta_merge and gff_merge, respectively. Following this step, I renamed my outputs using maker_map_ids (for .all.id.map), map_gff_ids for GFF files, and map_fasta_ids for both protein and transcript FASTA files.

Now, I'm at the stage where I want to blast the predicted proteins against a database, and here arise my initial questions:

Can I download the database from UniProt? Should I download the entire SwissProt protein set, or just SwissProt with the Fungi category selected? Or should I download only Fungi, but all fungal proteins available on UniProt? Despite these uncertainties, I decided to move forward to ensure that my pipeline works correctly. I performed a blastp analysis of the proteins obtained after the Maker analysis, using a protein database downloaded from SwissProt (Fungi category only).

I used this command:

blastp -db swissprot_fungi.fasta -query MYGEN.proteins.fasta -outfmt 5 -evalue 1e-5 -out MYGEN.proteins.xml -num_alignments 5 -num_threads 24

At this point, my second question arises. Is there a Python code or a similar tool that can combine the two files, MYGEN.proteins.fasta and MYGEN.proteins.xml, and return a GFF3 file?

In other words, I'm looking for code that does something like this:

python blast2annot.py -i MYGEN.proteins.fasta -b MYGEN.proteins.xml

Thank you very much for your help.

gff3 fasta maker xml blastp • 4.2k views

ADD COMMENT • link updated 2.1 years ago by Juke34 9.3k • written 2.2 years ago by Kinoppy • 0

score 2 · Answer 1 · 2023-09-20

2

Entering edit mode

2.2 years ago

Mensur Dlakic ★ 30k

Should I download the entire SwissProt protein set, or just SwissProt with the Fungi category selected?

SwissProt is only a tiny fraction of well-annotated proteins that will not be even close to containing all the proteins out there, whether for Fungi or in general. I suggest you consider adding TrEMBL to your searches. As to whether to use the whole database or just the fungal part, it depends on what you want to achieve. Assuming you are interested only in fungal homologs, it would be fine to use only the fungal subset.

Files for major taxonomic divisions are available, and again I suggest you get both SwissProt and TrEMBL data:

https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/taxonomic_divisions/

You would most likely need the .dat.gz files, which still have to be converted to FASTA. Maybe somewhere on the same website there is a way do download straight-up FASTA files.

ADD COMMENT • link 2.2 years ago by Mensur Dlakic ★ 30k

1

Entering edit mode

SwissProt fasta: https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz
Trembl fasta: https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_trembl.fasta.gz

Complete set.

ADD REPLY • link 2.2 years ago by GenoMax 154k

0

Entering edit mode

Thank you so much for your responses.

GenoMax , I've downloaded both of the databases you recommended. I attempted to combine them into a single FASTA file and created a database for the blastp search using makeblastdb. However, the resulting FASTA file was so large that despite running the blastp analysis on my server for several hours, no results were generated.

I also tried using makeblastdb with just the SwissProt dataset containing fungal sequences, and the analysis worked.

So, I assume that the analysis speed I'm experiencing depends on the size of the database I'm creating with makeblastdb.

That being said, I will try downloading the files in dat.gz format from the link Mensur Dlakic suggested.

At this point, here's the plan:

Download the two files uniprot_sprot_fungi.dat.gz and uniprot_trembl_fungi.dat.gz;
Perform a conversion to obtain FASTA files;
Combine the two FASTA files into a single one called "merged_db.fasta"
Create the database using makeblastdb with the command:

makeblastdb -in merged_db.fasta -dbtype prot.

Is this procedure correct?

ADD REPLY • link 2.2 years ago by Kinoppy • 0

0

Entering edit mode

I think this will work. For .dat -> .fas conversion it may be helpful to download a utility called esl-reformat which is a part of the HMMer package:

http://hmmer.org/

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 30k

0

Entering edit mode

I ran blastp with the following command:

blastp \
-db /path/to/Annotation_db/uniprot_fungi.fasta \
-query MYGEN.maker.output/MYGEN.all.maker.proteins.fasta  \
-outfmt 6  \
-out MYGEN_prot_blast.out \
-num_threads 4

I ran the analyses for 50:30 hours but I was not able to terminate the analyses for the whole set of protein.

Does anyone have some suggestion about how to speed up this process? Does the -num_threads option can help me increase the speed of this analyses??

Thank you so much for your help!

ADD REPLY • link 2.2 years ago by Kinoppy • 0

1

Entering edit mode

It should be faster, though you are not telling us how many proteins are in your query file.

I suggest you use as many threads as available. It helps if you run it on a fast computer. Finally, I suggest you try this with a single sequence to make sure everything works as intended.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 30k

0

Entering edit mode

I have 13,576 proteins in my file.

I tried on a small set of 3 proteins and it worked fine!

ADD REPLY • link 2.2 years ago by Kinoppy • 0

score 1 · Answer 2 · 2023-09-21

1

Entering edit mode

2.2 years ago

Juke34 9.3k

There are tools to combine output from blast in tabulated format like in the example here enter link description here(see functional annotation paragraph). So, replace -outfmt 5 by -outfmt 6 and follow the protocol.

ADD COMMENT • link 2.2 years ago by Juke34 9.3k

0

Entering edit mode

Hi Juke34, thank you so much for the response. I found the link you shared with me very useful.

I produced a interproscan annotation and I would like to procede with genebank submission of these result.

Instead of the intruction reported at poijnt 11 "Submission to public archive", is there a specific protocol/instruction that I can follow for publish my result in genebank?

I am trying following the genebank instruction but I am having quite a hard time with the production of a .sqn file.

Basically, after the interproscan analyses I took my interproscan result in tsv file.

I runned

ipr_update_gff MYGEN.gff MYGEN.iprscan.tsv > MYGEN.2.gff3

and

    iprscan2gff3 MYGEN.iprscan.tsv MYGEN.gff > MYGEN.2.view.gff3

After this step, I would like to convert my gff3 file in a five-table flat-file for genebank submission. Any suggestion?

Thank you so much for your help!

ADD REPLY • link 2.1 years ago by Kinoppy • 0

1

Entering edit mode

Part 11) explain how to prepare your file for submission to ebi. If you look at the EMBLmyGFF3 mentioned it gives information about an other tool for submission to NCBI called GAG. So either you follow the cumbersome NCBI protocol or you use GAG.

ADD REPLY • link 2.1 years ago by Juke34 9.3k

0

Entering edit mode

I'm providing an overview of the steps I've taken so far. I performed genome annotation using Maker3. Once that was completed, I took the proteins resulting from the annotation and ran an InterproScan analysis on them. At this point, I tried to associate the predicted functions with my gff3 file using the following command:

iprscan2gff3 MYGEN.ipr MYGEN.gff > MYGEN-IPR.gff3

Even at this stage, when I examine the output, I notice that there are no entries corresponding to the predicted functions. Referring back to the GAG example, even though the MYGEN.ipr file does contain potential function names (such as "PLP-dependent transferases" or "Acyltransferase family"), I can see that the MYGEN-IPR.gff3 file doesn't have any "product=..." entries that reflect these functions.

I don't understand what the problem could be at this point.

My MYGEN.gff look like this:

X158    maker   gene    10705   11803   .       +       .       ID=SUNI508_13472;Name=SUNI508_13472;Alias=augustus_masked-X158-processed-gene-0.11;
X158    maker   mRNA    10705   11803   .       +       .       ID=SUNI508_13472-RA;Parent=SUNI508_13472;Name=SUNI508_13472-RA;Alias=augustus_masked-X158-processed-gene-0.11-mRNA-1;_AED=1.00;_QI=0|0|0|0|1|1|2|0|348;_eAED=1.00;_merge_warning=1;
X158    maker   exon    10705   11147   .       +       .       ID=SUNI508_13472-RA:1;Parent=SUNI508_13472-RA;
X158    maker   exon    11200   11803   .       +       .       ID=SUNI508_13472-RA:2;Parent=SUNI508_13472-RA;
X158    maker   CDS     10705   11147   .       +       0       ID=SUNI508_13472-RA:cds;Parent=SUNI508_13472-RA;
X158    maker   CDS     11200   11803   .       +       1       ID=SUNI508_13472-RA:cds;Parent=SUNI508_13472-RA;

My MYGEN.ipr looks like this:

SUNI508_06618-RA        96fd1b3a4d0c269d8b5e9146ae86a4bf        974     Pfam    PF00023 Ankyrin repeat  911     942     0.013   T       19-10-2023      IPR002110       Ankyrin repeat  GO:0005515      MetaCyc: PWY-1
SUNI508_06618-RA        96fd1b3a4d0c269d8b5e9146ae86a4bf        974     SUPERFAMILY     SSF48403        Ankyrin repeat  753     959     9.02E-15        T       19-10-2023      IPR036770       Ankyrin repeat-contain
SUNI508_12593-RA        d21768a8d3a4e25e05717a2c8f413551        616     Pfam    PF12896 Anaphase-promoting complex, cyclosome, subunit 4        279     482     2.8E-68 T       19-10-2023      IPR024790       Anapha
SUNI508_12593-RA        d21768a8d3a4e25e05717a2c8f413551        616     Pfam    PF12894 Anaphase-promoting complex subunit 4 WD40 domain        24      116     5.2E-19 T       19-10-2023      IPR024977       Anapha
SUNI508_12593-RA        d21768a8d3a4e25e05717a2c8f413551        616     SUPERFAMILY     SSF50978        WD40 repeat-like        24      116     1.37E-8 T       19-10-2023      IPR036322       WD40-repeat-containing
SUNI508_02483-RA        f232aaa3c737be57b6c1cc60a468e122        448     SUPERFAMILY     SSF56112        Protein kinase-like (PK-like)   60      323     1.07E-11        T       19-10-2023      IPR011009       Protei

This is MYGEN-IPR.gff3

X158    .       contig  1       28284   .       .       .       ID=X158;Name=X158
X158    maker   gene    10705   11803   .       +       .       ID=SUNI508_13472;Name=SUNI508_13472;Alias=augustus_masked-X158-processed-gene-0.11;
X158    maker   mRNA    10705   11803   .       +       .       ID=SUNI508_13472-RA;Parent=SUNI508_13472;Name=SUNI508_13472-RA;Alias=augustus_masked-X158-processed-gene-0.11-mRNA-1;_AED=1.00;_QI=0|0|0|0|1|1|2|0|348
X158    maker   exon    10705   11147   .       +       .       ID=SUNI508_13472-RA:1;Parent=SUNI508_13472-RA;
X158    maker   exon    11200   11803   .       +       .       ID=SUNI508_13472-RA:2;Parent=SUNI508_13472-RA;
X158    maker   CDS     10705   11147   .       +       0       ID=SUNI508_13472-RA:cds;Parent=SUNI508_13472-RA;

Moreover, I used annie for tabulating my intrproscan result, and this is how it look like:

SUNI508_00001-RA        Dbxref  InterPro:IPR000608
SUNI508_00001-RA        Dbxref  InterPro:IPR016135
SUNI508_00001-RA        Dbxref  PFAM:PF00179
SUNI508_00001-RA        Dbxref  SUPERFAMILY:SSF54495
SUNI508_00002-RA        Dbxref  GO:0006457|GO:0051082
SUNI508_00002-RA        Dbxref  InterPro:IPR001623

And finally, when I tryed to run GAG, the only prduct I got is "product hypothetical protein".

So I suspect that the issue lies in the step where I attempt to associate the function predicted by InterProScan with the GFF3 file produced by Maker. Can anyone help me solve this problem?

Thank you!

ADD REPLY • link 2.1 years ago by Kinoppy • 0

1

Entering edit mode

Try agat_sp_manage_functional_annotation.pl from AGAT to attach functions into your GFF file you will probably get better result.

ADD REPLY • link 2.1 years ago by Juke34 9.3k

0

Entering edit mode

I runned AGAT and it did'worked. Basically, I got a gff3 file this:

X1      .       contig  1       1097952 .       .       .       ID=X1
X1      maker   gene    1939    2716    .       +       .       ID=SUNI508_00001;Alias=augustus_masked-X1-processed-gene-0.0
X1      maker   mRNA    1939    2716    .       +       .       ID=SUNI508_00001-RA;Parent=SUNI508_00001;Alias=augustus_masked-X1-processed-gene-0.0-mRNA-1;Dbxref=InterPro:IPR016135,InterPro:IPR000608,MetaCyc:PWY-6
X1      maker   exon    1939    2103    .       +       .       ID=SUNI508_00001-RA:1;Parent=SUNI508_00001-RA
X1      maker   exon    2201    2716    .       +       .       ID=SUNI508_00001-RA:2;Parent=SUNI508_00001-RA
X1      maker   CDS     1939    2103    .       +       0       ID=SUNI508_00001-RA:cds;Parent=SUNI508_00001-RA

Still I don't have any voice like "product= Anaphase-promoting complex, cyclosome, subunit 4".

So you think that the problem might be in the gff3 file I got from maker??

Thank you so much for the help!

ADD REPLY • link 2.1 years ago by Kinoppy • 0

1

Entering edit mode

Right you need the result of a Blast too. Go back to step 11) and read carefully

ADD REPLY • link 2.1 years ago by Juke34 9.3k