Question

How to extract/find the actual names of the gene_IDs if they are not fully presented in gtf.file, and link them to the Count.matrix

0

Entering edit mode

21 months ago

Pegasus ▴ 120

Hi all,

I checked the gtf.file for my reference genome (bacteria/ downloaded from NCBI), and it looks like there are missing some gene names but gene_IDs, and other features.

featureCounts generates count. matrix with only gene_IDs. So, how to find the corresponding gene names in order to advance the RNA-Seq analysis, and generate a volcano plot with the actual gene names?

Showing below part of the gtf.file content ;

JAFJXZ010000017.1   Genbank gene    1323794 1324765 .   -   .   gene_id "JYU28_16885"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "JYU28_16885"; 
JAFJXZ010000017.1   Protein Homology    CDS 1323797 1324765 .   -   0   gene_id "JYU28_16885"; transcript_id "unassigned_transcript_2777"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_007431750.1"; locus_tag "JYU28_16885"; product "LacI family DNA-binding transcriptional regulator"; protein_id "MBO3285865.1"; transl_table "11"; 
JAFJXZ010000017.1   Protein Homology    start_codon 1324763 1324765 .   -   0   gene_id "JYU28_16885"; transcript_id "unassigned_transcript_2777"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_007431750.1"; locus_tag "JYU28_16885"; product "LacI family DNA-binding transcriptional regulator"; protein_id "MBO3285865.1"; transl_table "11"; 
JAFJXZ010000017.1   Protein Homology    stop_codon  1323794 1323796 .   -   0   gene_id "JYU28_16885"; transcript_id "unassigned_transcript_2777"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_007431750.1"; locus_tag "JYU28_16885"; product "LacI family DNA-binding transcriptional regulator"; protein_id "MBO3285865.1"; transl_table "11";

featurecount.matrix 1st row gene_Ids stated as JYU28_number

Thank you

featureCounts gtf RNA-Seq • 1.1k views

ADD COMMENT • link updated 21 months ago by GenoMax 148k • written 21 months ago by Pegasus ▴ 120

1

Entering edit mode

Looks like this is a shotgun sequencing assembly so you are probably not going to find gene names. You may need to do with the gene_id which also seem to be repeated in locus_tag or do some additional annotation work yourself.

ADD REPLY • link updated 21 months ago by Ram 44k • written 21 months ago by GenoMax 148k

0

Entering edit mode

Thank you GenoMax, actually, this genome was annotated by NCBI themselves, and I wonder how to improve the annotation further.

I re-annotated the genome assembly using PROKKA, and here is part of the new gff.file content;

JAFJXZ010000012.1 Prodigal:002006 CDS 876326 877696 . - 0 ID=GOHBADNI_00845;Name=yheD_5;gene=yheD_5;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:O07545;locus_tag=GOHBADNI_00845;product=Endospore coat-associated protein YheD JAFJXZ010000012.1 Prodigal:002006 CDS 877859 878203 . + 0 ID=GOHBADNI_00846;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:O07542;locus_tag=GOHBADNI_00846;note=UPF0342 protein YheA;product=hypothetical protein

PROKKA changed the gene_IDs, and produced more gene_names (but not full list) and features. I believe I should use ID in featureCounts,

gffread prokka.gff

AFJXZ010000012.1 Genbank gene 4399 5565 . - . ID=gene-JYU28_03060;geneID=gene-JYU28_03060

JAFJXZ010000012.1 Genbank CDS 4399 5565 . - 0 Parent=gene-JYU28_03060

JAFJXZ010000012.1 Genbank gene 5914 6042 . - . ID=gene-JYU28_03065;geneID=gene-JYU28_03065

JAFJXZ010000012.1 Genbank CDS 5914 6042 . - 0 Parent=gene-JYU28_03065

JAFJXZ010000012.1 Genbank gene 6237 6500 . + . ID=gene-JYU28_03070;geneID=gene-JYU28_03070 JAFJXZ010000012.1 Genbank CDS 6237 6500 . + 0 Parent=gene-JYU28_03070

JAFJXZ010000012.1 Genbank gene 6646 7080 . + . ID=gene-JYU28_03075;geneID=gene-JYU28_03075

JAFJXZ010000012.1 Genbank CDS 6646 7080 . + 0 Parent=gene-JYU28_03075

JAFJXZ010000012.1 Genbank gene 7680 8165 . + . ID=gene-JYU28_03080;geneID=gene-JYU28_03080;gene_name=purE

JAFJXZ010000012.1 Genbank CDS 7680 8165 . + 0 Parent=gene-JYU28_03080

JAFJXZ010000012.1 Genbank gene 8162 9337 . + . ID=gene-JYU28_03085;geneID=gene-JYU28_03085;gene_name=purK

JAFJXZ010000012.1 Genbank CDS 8162 9337 . + 0 Parent=gene-JYU28_03085

JAFJXZ010000012.1 Genbank gene 9340 10638 . + . ID=gene-JYU28_03090;geneID=gene-JYU28_03090;gene_name=purB

JAFJXZ010000012.1 Genbank CDS 9340 10638 . + 0 Parent=gene-JYU28_03090

JAFJXZ010000012.1 Genbank gene 11362 12243 . + . ID=gene-JYU28_03095;geneID=gene-JYU28_03095

JAFJXZ010000012.1 Genbank CDS 11362 12243 . + 0 Parent=gene-JYU28_03095

But I got this error using featureCounts: ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..

What do you think is the cause of this error, and then how to link the IDs in the generated count.matrix with gene_names in the gtf.file, before advancing to edgeR/deseq2?

ADD REPLY • link 21 months ago by Pegasus ▴ 120

0

Entering edit mode

You could Convert your annotation file into Simple Annotation Format (SAF) that featureCounts understands. You will need to pull out GENE_ID CHR START STOP from your PROKKA file. Then use this file to count.

ADD REPLY • link 21 months ago by GenoMax 148k