HI,
I am folloowing Funannotate pipeline for annotation of fungal-genome for which i also have RNA-Seq data. I followed this tutorial on Funannotate-Docs.
I am just confused between some steps because some tools are coming up again and again and i cannot understand if i should run them over again.
So this it the pipeline i followed.
- Funannotate Clean: to remove conbtigs < 1000bp
- Funannotate sort: to sort contigs (big -> small)
- Repeats Annotation using Repeat Modeler/ Repeat Masker
- Funannotate-Train with
genome.fasta RNASeq_1.fq RNASeq_2.fq --jaccard-clip --no-trimmomatic
- Funannotate Predict
- Funannotate Update with
--jaccard_clip --no_trimmomatic
This is where i am confused.
- Interproscan:
At this point tutorial says to run funannotate iprscan
which was generating an empty file for me without any log to identify issue, mentioned in one of my old unanswered post here.
So I installed **Interproscan-v5.73-104.0**
locally and also installed SignalP-4.1
, Phobius-1.0.1
and TmHMM-2.0c
as they were showing up as deprecated analysis and i wanted to include them. the resulting .gff3 file shows records for the following
interproscan \
--input $Fanno_out/$prefix/update_results/proteins.fa \
--output-file-base /interproscan_results/proteins.fa.interproscan \
--cpu 90 --disable-precalc --goterms --iprlookup --pathways --seqtype p \
--formats XML,TSV,GFF3 \
--excl-applications SignalP_GRAM_NEGATIVE, SignalP_GRAM_POSITIVE \
--tempdir /interproscan_results/ --verbose
Command:
grep -v "#" proteins.fa.interproscan.gff3 | awk '{print $2}' | sort | uniq
Results:
CDD, Coils, FunFam, Gene3D, Hamap, MobiDBLite, NCBIfam, PANTHER, Pfam, Phobius, PIRSF, PIRSR, PRINTS, ProSitePatterns, ProSiteProfiles, SFLD, SignalP_EUK, SMART, SUPERFAM, LY, TMHMM
- antiSMASH Fungi:
Next i ran **antiSMASH-v7.1.0**
which i also installed locally and ran following this command
antismash \
--taxon fungi --cpus 94 --verbose --debug --genefinding-tool none --no-abort-on-invalid-records \
--fullhmmer --cassis --clusterhmmer --tigrfam --asf --cc-mibig --cb-general --cb-subclusters --cb-knownclusters --pfam2go --rre --smcog-trees --tfbs \
--output-basename antismash --output-dir $Fanno_out/$prefix/antismash_results \
$Fanno_out/$prefix/update_results/resulting.gbk
Although, antiSMASH is giving me errors of multiple CDS with same coordinates (because i used RNA-seq data, my .gbk files have multiple transcripts) and i am still looking for a way to solve this.
- Phobius:
At this stage in tutorial, it shows to Optionally run Phobius
and i am confused if i should run it or not as it already ran during the Interproscan
step.
- SignalP:
Should i also run SignalP individually and then pass the resulting file to Funannotate annotate
step or let annotate step itselp run SignalP
- Funannotate annotate:
funannotate annotate \
--input $Fanno_out/$prefix/ \
--antismash $Fanno_out/$prefix/antismash_results/antismash.gbk \
--iprscan $Fanno_out/$prefix/interproscan_results/proteins.fa.xml \
--cpus 94
Questiuons:
- SHould i run SignalP individually ?
- should i run Phobius individually or one which ran with Interproscan is enough ?
- Is this complete approach to annotate assemblies correct or i am missing anything ?
Thank you.
HI, Thank you for your reply. I was kinda losinf hope that i will get any suggestions on these queries.
Funannotate
annotate
step (last one in annotateion pipeline) doesn't require these files to continue with annalysis but there are options to provide these files as extras (optional)I tried the
agat_sp_keep_longest_isoform.pl
on my resulting GFF annotations file, But it still kept some Transcripts/CDS which had same name but i guess were longest and antiSMASH-Fungi skipped them again. If there is anyother instance of AGAT which you are reffereing to please let me know.Again, Thankyou for clearing out those confusions i was having.
ok, nice ... so you're analysis is working now? (== you get meaningful results?)
Perhaps have a detailed (manual) look at some of the still overlapping coordinates genes and see what is going on. AGAT can not perform magic ... if for some reason it is not safely possible to merge them or promote a representative if will not of course. Do they pose an issue for your results?
when i parse the
.gbk
file to antiSMASH, if raises the issue of Multiple CDS on same locations and skipps atleast 6 longest first scaffolds. i used option--no-abort-on-invalid-records
otherwise it just stops. the log file shows thisI found sugestive solutiuons to use
agat_sp_keep_longest_isoform.pl
to keep only longest isoform. so i ran it on thethen i ran antismash by parsing this new GFF3 to
--genefinding-gff3
option withmy_genome.fa
using commangand it just stops with following std_out
My original GFF3 file has 129118 entries and 3rd gene looks like this
and update GFF3 file has 120063 entries with 3rd gene looking like this
Can you suggest any approach here ?
the AGAT GFF files looks fine at first sight (and I'm pretty confident it reports valid GFF3 files) .
can it be there is a typo in your antismash cmdline? The AGAT output file is called
my_annotation_longest_isoform.gff3
and the input for anitsmash it sayslongest_isoform.gff3 genome.fa
I'm totally not familiar with antiSMASH though ...
I updated the type. it was only in this post.
the only difference i feel from original is the sorting of exons. and it has all those extra CDS with similar IDs.
But thanks again for your response.
There are no extra CDS, those were there already in your original file.
Indeed the ordering is different but I'm not sure if it's wrong in essence (perhaps give it a try with a file sorted on start coordinates? for instance make a file with one this gene in it and adapt it until it is accepted then you'll know what the exact issue is )
I don't think that AGAT will correct the IDs for the CDSs (which were there in the original file as well, or rather lacking as well) by default. There might be a different AGAT sub-tool that can fix this (is there not one for fixing or correcting or such? or have a look at there doc-site for advice?)