Entering edit mode
19 months ago
Pegasus
▴
120
Hi all,
Following the RNA-seq analysis workflow, I am trying to find the GO gene ontology terms for a list of DGEs output of (FeatuCounts > edgeR). I conducted the RNA-seq analysis using either RAST-annotated gtf, or NCBI-PGAP gft files.
1 - In Rast gtf.file the majority of genes are as below (No locus_tag, no transcripts_id)
Scaffold_3 FIG CDS 1598714 1599913 . - 2 ID=fig|6666666.1005592.peg.4310;Name=Quinolone resistance NorA protein
- In NCBI-PGAP, the majority of the genes like below (gene_ID = transcript_ID = locus_tag)
GeneMarkS-2+ stop_codon 235 237 . + 0 gene_id "JYU28_00005"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS-2+"; locus_tag "JYU28_00005"; partial "true"; product "IS5/IS1182 family transposase"; protein_id "MBO3282641.1"; transl_table "11"; exon_number "1";
In both cases, the gene_IDs are unrelated to any database, even refseq, so I couldn't convert the DGEs list to enterZ, ensembl, or UniProt IDs, which I can use in further GO enrichment analysis.
I appreciate any help or suggestion to find a solution for this issue,
Thank you
You will need to annotate these yourself by doing additional work. Pipelines you mention are likely producing computer predictions with no additional validation/annotation.
Thank you GenoMax, I checked the Biostar-booklet, and could not find a solution for such an issue, since the organism is not a modal one. Can you please recommend a pipeline, tool, or any workflow to do such annotation?
Thanks
Since you appear to be working with bacterial genomes
prokka
(LINK) is probably the program of choice. For eukaryotic genomes you would go tomaker
and others.But as I read in another post Prokka Annotation or NCBI Annotation /reply by Mensur Dlakic ;
"If I remember correctly, prokka comes only with HAMAP database of HMMs, which will produce terrible annotations on prokaryotic genomes. To get good annotations you would need to install at least Pfam and TIGRfams. Don't know if you have done that or not, but you can find out by looking at prokka's annotations. If there are many hypothetical proteins for prokka where NCBI files have meaningful annotations, chances are that you don't have any extra prokka HMM databases. If you are literally comparing identical genomes, it may be better to go with NCBI annotations"
Since NCBI annotations in your case are automated and not very useful it will be for you to decide if you want to take the time to complete the annotations properly.
You could do your DE analysis with those ID's and then spend some time manually annotating top DE genes if you don't want to spend the time to make Prokka work as described by Mensur.
It is not permitted to upload the recommended databases into HPC or galaxy website server, so improving the prokka annotation data is difficult. Meanwhile, the gene Ontology analysis required both DEs and all expressed genes as a background, which turns out to be a nightmare to annotate all these genes manually.