Entering edit mode
21 months ago
Pegasus
▴
120
Hi all,
I checked the gtf.file for my reference genome (bacteria/ downloaded from NCBI), and it looks like there are missing some gene names but gene_IDs, and other features.
featureCounts generates count. matrix with only gene_IDs. So, how to find the corresponding gene names in order to advance the RNA-Seq analysis, and generate a volcano plot with the actual gene names?
Showing below part of the gtf.file content ;
JAFJXZ010000017.1 Genbank gene 1323794 1324765 . - . gene_id "JYU28_16885"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "JYU28_16885";
JAFJXZ010000017.1 Protein Homology CDS 1323797 1324765 . - 0 gene_id "JYU28_16885"; transcript_id "unassigned_transcript_2777"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_007431750.1"; locus_tag "JYU28_16885"; product "LacI family DNA-binding transcriptional regulator"; protein_id "MBO3285865.1"; transl_table "11";
JAFJXZ010000017.1 Protein Homology start_codon 1324763 1324765 . - 0 gene_id "JYU28_16885"; transcript_id "unassigned_transcript_2777"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_007431750.1"; locus_tag "JYU28_16885"; product "LacI family DNA-binding transcriptional regulator"; protein_id "MBO3285865.1"; transl_table "11";
JAFJXZ010000017.1 Protein Homology stop_codon 1323794 1323796 . - 0 gene_id "JYU28_16885"; transcript_id "unassigned_transcript_2777"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_007431750.1"; locus_tag "JYU28_16885"; product "LacI family DNA-binding transcriptional regulator"; protein_id "MBO3285865.1"; transl_table "11";
featurecount.matrix 1st row gene_Ids stated as JYU28_number
Thank you
Looks like this is a shotgun sequencing assembly so you are probably not going to find gene names. You may need to do with the
gene_id
which also seem to be repeated inlocus_tag
or do some additional annotation work yourself.Thank you GenoMax, actually, this genome was annotated by NCBI themselves, and I wonder how to improve the annotation further.
JAFJXZ010000012.1 Prodigal:002006 CDS 876326 877696 . - 0 ID=GOHBADNI_00845;Name=yheD_5;gene=yheD_5;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:O07545;locus_tag=GOHBADNI_00845;product=Endospore coat-associated protein YheD JAFJXZ010000012.1 Prodigal:002006 CDS 877859 878203 . + 0 ID=GOHBADNI_00846;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:O07542;locus_tag=GOHBADNI_00846;note=UPF0342 protein YheA;product=hypothetical protein
PROKKA changed the gene_IDs, and produced more gene_names (but not full list) and features. I believe I should use ID in featureCounts,
gffread prokka.gff
AFJXZ010000012.1 Genbank gene 4399 5565 . - . ID=gene-JYU28_03060;geneID=gene-JYU28_03060
JAFJXZ010000012.1 Genbank CDS 4399 5565 . - 0 Parent=gene-JYU28_03060
JAFJXZ010000012.1 Genbank gene 5914 6042 . - . ID=gene-JYU28_03065;geneID=gene-JYU28_03065
JAFJXZ010000012.1 Genbank CDS 5914 6042 . - 0 Parent=gene-JYU28_03065
JAFJXZ010000012.1 Genbank gene 6237 6500 . + . ID=gene-JYU28_03070;geneID=gene-JYU28_03070 JAFJXZ010000012.1 Genbank CDS 6237 6500 . + 0 Parent=gene-JYU28_03070
JAFJXZ010000012.1 Genbank gene 6646 7080 . + . ID=gene-JYU28_03075;geneID=gene-JYU28_03075
JAFJXZ010000012.1 Genbank CDS 6646 7080 . + 0 Parent=gene-JYU28_03075
JAFJXZ010000012.1 Genbank gene 7680 8165 . + . ID=gene-JYU28_03080;geneID=gene-JYU28_03080;gene_name=purE
JAFJXZ010000012.1 Genbank CDS 7680 8165 . + 0 Parent=gene-JYU28_03080
JAFJXZ010000012.1 Genbank gene 8162 9337 . + . ID=gene-JYU28_03085;geneID=gene-JYU28_03085;gene_name=purK
JAFJXZ010000012.1 Genbank CDS 8162 9337 . + 0 Parent=gene-JYU28_03085
JAFJXZ010000012.1 Genbank gene 9340 10638 . + . ID=gene-JYU28_03090;geneID=gene-JYU28_03090;gene_name=purB
JAFJXZ010000012.1 Genbank CDS 9340 10638 . + 0 Parent=gene-JYU28_03090
JAFJXZ010000012.1 Genbank gene 11362 12243 . + . ID=gene-JYU28_03095;geneID=gene-JYU28_03095
JAFJXZ010000012.1 Genbank CDS 11362 12243 . + 0 Parent=gene-JYU28_03095
But I got this error using featureCounts: ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
What do you think is the cause of this error, and then how to link the IDs in the generated count.matrix with gene_names in the gtf.file, before advancing to edgeR/deseq2?
You could Convert your annotation file into Simple Annotation Format (SAF) that featureCounts understands. You will need to pull out
GENE_ID CHR START STOP
from your PROKKA file. Then use this file to count.