Help with Funannotate
1
0
Entering edit mode
5 weeks ago
SomeOne ▴ 170

HI,

I am folloowing Funannotate pipeline for annotation of fungal-genome for which i also have RNA-Seq data. I followed this tutorial on Funannotate-Docs.

I am just confused between some steps because some tools are coming up again and again and i cannot understand if i should run them over again.

So this it the pipeline i followed.

  1. Funannotate Clean: to remove conbtigs < 1000bp
  2. Funannotate sort: to sort contigs (big -> small)
  3. Repeats Annotation using Repeat Modeler/ Repeat Masker
  4. Funannotate-Train with genome.fasta RNASeq_1.fq RNASeq_2.fq --jaccard-clip --no-trimmomatic
  5. Funannotate Predict
  6. Funannotate Update with --jaccard_clip --no_trimmomatic

This is where i am confused.

  1. Interproscan:

At this point tutorial says to run funannotate iprscan which was generating an empty file for me without any log to identify issue, mentioned in one of my old unanswered post here.

So I installed **Interproscan-v5.73-104.0** locally and also installed SignalP-4.1, Phobius-1.0.1 and TmHMM-2.0c as they were showing up as deprecated analysis and i wanted to include them. the resulting .gff3 file shows records for the following

interproscan \
    --input $Fanno_out/$prefix/update_results/proteins.fa \
    --output-file-base /interproscan_results/proteins.fa.interproscan \
    --cpu 90 --disable-precalc --goterms --iprlookup --pathways --seqtype p \
    --formats XML,TSV,GFF3 \
    --excl-applications SignalP_GRAM_NEGATIVE, SignalP_GRAM_POSITIVE \
    --tempdir /interproscan_results/ --verbose

Command:
grep -v "#" proteins.fa.interproscan.gff3 | awk '{print $2}' | sort | uniq

Results:
CDD, Coils, FunFam, Gene3D, Hamap, MobiDBLite, NCBIfam, PANTHER, Pfam, Phobius, PIRSF, PIRSR, PRINTS, ProSitePatterns, ProSiteProfiles, SFLD, SignalP_EUK, SMART, SUPERFAM, LY, TMHMM
  1. antiSMASH Fungi:

Next i ran **antiSMASH-v7.1.0** which i also installed locally and ran following this command

antismash \
    --taxon fungi --cpus 94 --verbose --debug --genefinding-tool none --no-abort-on-invalid-records \
    --fullhmmer --cassis --clusterhmmer --tigrfam --asf --cc-mibig --cb-general --cb-subclusters --cb-knownclusters --pfam2go --rre --smcog-trees --tfbs \
    --output-basename antismash --output-dir $Fanno_out/$prefix/antismash_results \
    $Fanno_out/$prefix/update_results/resulting.gbk

Although, antiSMASH is giving me errors of multiple CDS with same coordinates (because i used RNA-seq data, my .gbk files have multiple transcripts) and i am still looking for a way to solve this.

  1. Phobius:

At this stage in tutorial, it shows to Optionally run Phobius and i am confused if i should run it or not as it already ran during the Interproscan step.

  1. SignalP:

Should i also run SignalP individually and then pass the resulting file to Funannotate annotate step or let annotate step itselp run SignalP

  1. Funannotate annotate:
funannotate annotate \
        --input $Fanno_out/$prefix/ \
        --antismash $Fanno_out/$prefix/antismash_results/antismash.gbk \
        --iprscan $Fanno_out/$prefix/interproscan_results/proteins.fa.xml \
        --cpus 94

Questiuons:

  1. SHould i run SignalP individually ?
  2. should i run Phobius individually or one which ran with Interproscan is enough ?
  3. Is this complete approach to annotate assemblies correct or i am missing anything ?

Thank you.

genomes funannotate fungus annotation • 609 views
ADD COMMENT
1
Entering edit mode
4 weeks ago

1) no, if you ran it within Interproscan it should be OK. (keep in mind you will need to configure interproscan to include this analysis)

2) same answer as above

(I also don't think that funnannotate requires them as a separate input file , but not 100% sure of that)

3) It looks you're doing it right indeed. Though other approaches also exists but this should get you already a long way.

to resolve the multiple CDS with same coords you can consider to run for instance AGAT on your GFF annotation files, it can 'merge' CDS and report a 'representative' one per locus. Afterwards you'll need to extract a fresh protein set using the new GFF file and use that as input for your analysis.

ADD COMMENT
0
Entering edit mode

HI, Thank you for your reply. I was kinda losinf hope that i will get any suggestions on these queries.

  1. I did configure Interproscan to include all these analyses.
  2. Funannotate annotate step (last one in annotateion pipeline) doesn't require these files to continue with annalysis but there are options to provide these files as extras (optional)

  3. I tried the agat_sp_keep_longest_isoform.pl on my resulting GFF annotations file, But it still kept some Transcripts/CDS which had same name but i guess were longest and antiSMASH-Fungi skipped them again. If there is anyother instance of AGAT which you are reffereing to please let me know.

Again, Thankyou for clearing out those confusions i was having.

ADD REPLY
0
Entering edit mode

ok, nice ... so you're analysis is working now? (== you get meaningful results?)

Perhaps have a detailed (manual) look at some of the still overlapping coordinates genes and see what is going on. AGAT can not perform magic ... if for some reason it is not safely possible to merge them or promote a representative if will not of course. Do they pose an issue for your results?

ADD REPLY
0
Entering edit mode

when i parse the .gbk file to antiSMASH, if raises the issue of Multiple CDS on same locations and skipps atleast 6 longest first scaffolds. i used option --no-abort-on-invalid-records otherwise it just stops. the log file shows this

WARNING  26/02 14:16:59   Ignoring invalid record 'NP02_scf_1' Multiple CDS features have the same location: join{[1038961:1040307](-), [1038711:1038912](-), [1038308:1038654](-)}
WARNING  26/02 14:17:01   Ignoring invalid record 'NP02_scf_3' Multiple CDS features have the same location: join{[916563:916979](+), [917025:918196](+)}
WARNING  26/02 14:17:02   Ignoring invalid record 'NP02_scf_4' Multiple CDS features have the same location: join{[2918254:2918312](+), [2918481:2918514](+), [2918573:2918735](+), [2918793:2919095](+), [2919144:2919654](+)}
WARNING  26/02 14:17:02   Ignoring invalid record 'NP02_scf_5' Multiple CDS features have the same location: join{[2508044:2509360](+), [2509413:2509723](+)}
WARNING  26/02 14:17:03   Ignoring invalid record 'NP02_scf_6' Multiple CDS features have the same location: join{[1900341:1900568](-), [1900265:1900281](-), [1900007:1900144](-), [1899799:1899953](-)}
WARNING  26/02 14:17:04   Ignoring invalid record 'NP02_scf_7' Multiple CDS features have the same location: join{[2843795:2844173](-), [2843653:2843740](-)}

I found sugestive solutiuons to use agat_sp_keep_longest_isoform.pl to keep only longest isoform. so i ran it on the

agat_sp_keep_longest_isoform.pl funannotate/updated_results/my_annotation.gff3 -o funannotate/antismash_results/my_annotation_longest_isoform.gff3

then i ran antismash by parsing this new GFF3 to --genefinding-gff3 option with my_genome.fa using commang

antismash \
    --taxon fungi --cpus 94 --verbose --debug --genefinding-tool none \
    --fullhmmer --cassis --clusterhmmer --tigrfam --asf --cc-mibig --cb-general --cb-subclusters --cb-knownclusters --pfam2go --rre --smcog-trees --tfbs \
    --output-basename NRRL32931_antismash --output-dir ./ \
    --genefinding-gff3 my_annotation_longest_isoform.gff3 genome.fa

and it just stops with following std_out

DEBUG    04/03 12:15:50   Loading annotations from GFF file
ERROR    04/03 12:15:55   could not parse records from GFF3 file
ERROR:   could not parse records from GFF3 file

My original GFF3 file has 129118 entries and 3rd gene looks like this

NP02_scf_1  funannotate gene    19484   25478   .   -   .   ID=FUN_000003;
NP02_scf_1  funannotate mRNA    19484   25478   .   -   .   ID=FUN_000003-T1;Parent=FUN_000003;product=hypothetical protein;
NP02_scf_1  funannotate five_prime_UTR  25339   25478   .   -   .   ID=FUN_000003-T1.utr5p1;Parent=FUN_000003-T1;
NP02_scf_1  funannotate exon    25065   25478   .   -   .   ID=FUN_000003-T1.exon1;Parent=FUN_000003-T1;
NP02_scf_1  funannotate exon    24895   25015   .   -   .   ID=FUN_000003-T1.exon2;Parent=FUN_000003-T1;
NP02_scf_1  funannotate exon    24178   24845   .   -   .   ID=FUN_000003-T1.exon3;Parent=FUN_000003-T1;
NP02_scf_1  funannotate exon    22030   24123   .   -   .   ID=FUN_000003-T1.exon4;Parent=FUN_000003-T1;
NP02_scf_1  funannotate exon    21658   21965   .   -   .   ID=FUN_000003-T1.exon5;Parent=FUN_000003-T1;
NP02_scf_1  funannotate exon    21499   21600   .   -   .   ID=FUN_000003-T1.exon6;Parent=FUN_000003-T1;
NP02_scf_1  funannotate exon    19484   21437   .   -   .   ID=FUN_000003-T1.exon7;Parent=FUN_000003-T1;
NP02_scf_1  funannotate three_prime_UTR 19484   21011   .   -   .   ID=FUN_000003-T1.utr3p1;Parent=FUN_000003-T1;
NP02_scf_1  funannotate CDS 25065   25338   .   -   0   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1;
NP02_scf_1  funannotate CDS 24895   25015   .   -   2   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1;
NP02_scf_1  funannotate CDS 24178   24845   .   -   1   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1;
NP02_scf_1  funannotate CDS 22030   24123   .   -   2   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1;
NP02_scf_1  funannotate CDS 21658   21965   .   -   2   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1;
NP02_scf_1  funannotate CDS 21499   21600   .   -   0   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1;
NP02_scf_1  funannotate CDS 21012   21437   .   -   0   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1;

and update GFF3 file has 120063 entries with 3rd gene looking like this

NP02_scf_1  funannotate gene    19484   25478   .   -   .   ID=FUN_000003
NP02_scf_1  funannotate mRNA    19484   25478   .   -   .   ID=FUN_000003-T1;Parent=FUN_000003;product=hypothetical protein
NP02_scf_1  funannotate exon    19484   21437   .   -   .   ID=FUN_000003-T1.exon7;Parent=FUN_000003-T1
NP02_scf_1  funannotate exon    21499   21600   .   -   .   ID=FUN_000003-T1.exon6;Parent=FUN_000003-T1
NP02_scf_1  funannotate exon    21658   21965   .   -   .   ID=FUN_000003-T1.exon5;Parent=FUN_000003-T1
NP02_scf_1  funannotate exon    22030   24123   .   -   .   ID=FUN_000003-T1.exon4;Parent=FUN_000003-T1
NP02_scf_1  funannotate exon    24178   24845   .   -   .   ID=FUN_000003-T1.exon3;Parent=FUN_000003-T1
NP02_scf_1  funannotate exon    24895   25015   .   -   .   ID=FUN_000003-T1.exon2;Parent=FUN_000003-T1
NP02_scf_1  funannotate exon    25065   25478   .   -   .   ID=FUN_000003-T1.exon1;Parent=FUN_000003-T1
NP02_scf_1  funannotate CDS 21012   21437   .   -   0   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1
NP02_scf_1  funannotate CDS 21499   21600   .   -   0   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1
NP02_scf_1  funannotate CDS 21658   21965   .   -   2   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1
NP02_scf_1  funannotate CDS 22030   24123   .   -   2   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1
NP02_scf_1  funannotate CDS 24178   24845   .   -   1   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1
NP02_scf_1  funannotate CDS 24895   25015   .   -   2   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1
NP02_scf_1  funannotate CDS 25065   25338   .   -   0   ID=FUN_000003-T1.cds;Parent=FUN_000003-T1
NP02_scf_1  funannotate five_prime_UTR  25339   25478   .   -   .   ID=FUN_000003-T1.utr5p1;Parent=FUN_000003-T1
NP02_scf_1  funannotate three_prime_UTR 19484   21011   .   -   .   ID=FUN_000003-T1.utr3p1;Parent=FUN_000003-T1

Can you suggest any approach here ?

ADD REPLY
0
Entering edit mode

the AGAT GFF files looks fine at first sight (and I'm pretty confident it reports valid GFF3 files) .

can it be there is a typo in your antismash cmdline? The AGAT output file is called my_annotation_longest_isoform.gff3 and the input for anitsmash it says longest_isoform.gff3 genome.fa

I'm totally not familiar with antiSMASH though ...

ADD REPLY
0
Entering edit mode

I updated the type. it was only in this post.

the AGAT GFF files looks fine at first sight (and I'm pretty confident it reports valid GFF3 files)

the only difference i feel from original is the sorting of exons. and it has all those extra CDS with similar IDs.

But thanks again for your response.

ADD REPLY
0
Entering edit mode

There are no extra CDS, those were there already in your original file.

Indeed the ordering is different but I'm not sure if it's wrong in essence (perhaps give it a try with a file sorted on start coordinates? for instance make a file with one this gene in it and adapt it until it is accepted then you'll know what the exact issue is )

I don't think that AGAT will correct the IDs for the CDSs (which were there in the original file as well, or rather lacking as well) by default. There might be a different AGAT sub-tool that can fix this (is there not one for fixing or correcting or such? or have a look at there doc-site for advice?)

ADD REPLY

Login before adding your answer.

Traffic: 1278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6