Question

Find data-based Gene_IDs for unknown gene_IDs in gtf.file

1

Entering edit mode

19 months ago

Pegasus ▴ 120

Hi all,

Following the RNA-seq analysis workflow, I am trying to find the GO gene ontology terms for a list of DGEs output of (FeatuCounts > edgeR). I conducted the RNA-seq analysis using either RAST-annotated gtf, or NCBI-PGAP gft files.

1 - In Rast gtf.file the majority of genes are as below (No locus_tag, no transcripts_id)

Scaffold_3  FIG CDS 1598714 1599913 .   -   2 ID=fig|6666666.1005592.peg.4310;Name=Quinolone resistance NorA protein

In NCBI-PGAP, the majority of the genes like below (gene_ID = transcript_ID = locus_tag)

GeneMarkS-2+ stop_codon 235 237 . + 0 gene_id "JYU28_00005"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS-2+"; locus_tag "JYU28_00005"; partial "true"; product "IS5/IS1182 family transposase"; protein_id "MBO3282641.1"; transl_table "11"; exon_number "1";

In both cases, the gene_IDs are unrelated to any database, even refseq, so I couldn't convert the DGEs list to enterZ, ensembl, or UniProt IDs, which I can use in further GO enrichment analysis.

I appreciate any help or suggestion to find a solution for this issue,

Thank you

RNA-SEQ annotation GO-term • 997 views

ADD COMMENT • link 19 months ago by Pegasus ▴ 120

0

Entering edit mode

You will need to annotate these yourself by doing additional work. Pipelines you mention are likely producing computer predictions with no additional validation/annotation.

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

Thank you GenoMax, I checked the Biostar-booklet, and could not find a solution for such an issue, since the organism is not a modal one. Can you please recommend a pipeline, tool, or any workflow to do such annotation?

Thanks

ADD REPLY • link 19 months ago by Pegasus ▴ 120

1

Entering edit mode

Since you appear to be working with bacterial genomes prokka (LINK) is probably the program of choice. For eukaryotic genomes you would go to maker and others.

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

But as I read in another post Prokka Annotation or NCBI Annotation /reply by Mensur Dlakic ;

"If I remember correctly, prokka comes only with HAMAP database of HMMs, which will produce terrible annotations on prokaryotic genomes. To get good annotations you would need to install at least Pfam and TIGRfams. Don't know if you have done that or not, but you can find out by looking at prokka's annotations. If there are many hypothetical proteins for prokka where NCBI files have meaningful annotations, chances are that you don't have any extra prokka HMM databases. If you are literally comparing identical genomes, it may be better to go with NCBI annotations"

ADD REPLY • link 19 months ago by Pegasus ▴ 120

0

Entering edit mode

Since NCBI annotations in your case are automated and not very useful it will be for you to decide if you want to take the time to complete the annotations properly.

You could do your DE analysis with those ID's and then spend some time manually annotating top DE genes if you don't want to spend the time to make Prokka work as described by Mensur.

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

It is not permitted to upload the recommended databases into HPC or galaxy website server, so improving the prokka annotation data is difficult. Meanwhile, the gene Ontology analysis required both DEs and all expressed genes as a background, which turns out to be a nightmare to annotate all these genes manually.

ADD REPLY • link 19 months ago by Pegasus ▴ 120