Question

How are duplicated genes named under GTF file?

0

Entering edit mode

21 months ago

Petesview ▴ 10

Hi,

Some genes in the genome are known to be duplicated, hence there are multiple copies of the same protein-coding sequence but at different loci. My question is, how are these duplicated genes named under GENCODE or Refseq annotation (gtf or gff file)? Do these have exact same gene names, but with numerical suffices (eg. Gene_A.1 and Gene_A.2)? Do these have different gene names (eg. Gene_A1 and Gene_A2)? Or are all duplicates of a gene integrated into a gene name and quantified together (eg. Gene_A)?

Also, do different splice variants of a gene have its own gene name and are quantified separately under the gtf or gff file of a standard RNA-seq analysis?

RNA-seq • 1.7k views

ADD COMMENT • link updated 21 months ago by biofalconch ★ 1.3k • written 21 months ago by Petesview ▴ 10

0

Entering edit mode

do different splice variants of a gene have its own gene name and are quantified separately under the gtf or gff file of a standard RNA-seq analysis?

They don't have different gene IDs, but different transcript IDs, depending on the tool you are using for quantifying and the downstream analyses they can be quantified separately, but most people quantify them at the gene level.

ADD REPLY • link 21 months ago by biofalconch ★ 1.3k

score 0 · Answer 1 · 2023-08-14

Note, this is for genes that are _annotated_ >1 time on the same assembly. Duplicated genes are typically just annotated twice with distinct GeneIDs; the gene symbol may be the same for these but the GeneID is not.

In the case of RefSeq annotation files, if GeneA is annotated twice then the gene_id attribute in the GTF file will be GeneA for the first instance and GeneA_1 for the second instance. In both cases, the gene attribute will have the value GeneA, which is the official symbol for that gene. For example, look at the annotation of the mouse gene Erdr1x. In the GFF3 file, the following gene rows are present:

NC_000086.8 BestRefSeq  pseudogene  168793522   168801793   .   +   .   ID=gene-Erdr1x;Dbxref=GeneID:170942,MGI:MGI:2384747;Name=Erdr1x;description=erythroid differentiation regulator 1 x;end_range=168801793,.;gbkey=Gene;gene=Erdr1x;gene_biotype=transcribed_pseudogene;gene_synonym=edr,Erdr1,Gm21887,Gm55594;partial=true;pseudo=true
NC_000087.8 BestRefSeq  pseudogene  90796711    90827734    .   +   .   ID=gene-Erdr1x-2;Dbxref=GeneID:170942,MGI:MGI:2384747;Name=Erdr1x;description=erythroid differentiation regulator 1 x;gbkey=Gene;gene=Erdr1x;gene_biotype=transcribed_pseudogene;gene_synonym=edr,Erdr1,Gm21887,Gm55594;pseudo=true

That same gene has the following two rows in GTF:

NC_000086.8 BestRefSeq  gene    168793522   168801793   .   +   .   gene_id "Erdr1x"; transcript_id ""; db_xref "GeneID:170942"; db_xref "MGI:MGI:2384747"; description "erythroid differentiation regulator 1 x"; gbkey "Gene"; gene "Erdr1x"; gene_biotype "transcribed_pseudogene"; gene_synonym "edr"; gene_synonym "Erdr1"; gene_synonym "Gm21887"; gene_synonym "Gm55594"; partial "true"; pseudo "true"; 
NC_000087.8 BestRefSeq  gene    90796711    90827734    .   +   .   gene_id "Erdr1x_1"; transcript_id ""; db_xref "GeneID:170942"; db_xref "MGI:MGI:2384747"; description "erythroid differentiation regulator 1 x"; gbkey "Gene"; gene "Erdr1x"; gene_biotype "transcribed_pseudogene"; gene_synonym "edr"; gene_synonym "Erdr1"; gene_synonym "Gm21887"; gene_synonym "Gm55594"; pseudo "true";