Question

How to get gene list from GENCODE gtf file?

0

Entering edit mode

21 months ago

ConvolutedGenome ▴ 50

Hi all,

The statistics page of GENCODE Release 26 (https://www.gencodegenes.org/human/stats_26.html) states that it contains 19817 protein-coding genes (I think this is about the right number of protein-coding genes in the human genome)

However, the actual file is very confusing since it got transcript, lncRNA, exons, etc...

I want to extract a very simple information here: I just want all (protein-coding) genes in the human genome and their start - end coordinate (there should be 19817 of them)

How can I extract such information from the "gencode.v26.annotation.gtf" file?

Thank you so much!

protein-coding GENCODE gene • 852 views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 21 months ago by ConvolutedGenome ▴ 50

score 1 · Answer 1 · 2023-02-20

1

Entering edit mode

21 months ago

ATpoint 85k

Just go for the lines of column type with attribute gene and select column gene_type with attribute protein-coding. It's a great parsing exercise. Please try anything, happy helping debug on it.

ADD COMMENT • link 21 months ago by ATpoint 85k

0

Entering edit mode

Thank you!

I just want to clarify, So you mentioned that I should look for row where the "type" column has the value of "gene" (I assume this is the 3rd column (index = 2))

However, for a given gene_id, there are multiple entries that have different values under the "type" column For example, there are 140 entries corresponding to the TERT gene

enter image description here

So what are the other entries for? my understanding is that "gene" refer to the full length of the gene (including every exons and introns) whereas the first "exon" row corresponds to the coordinate of the first exon WITHIN the full length of the gene and so on..?

ADD REPLY • link 21 months ago by ConvolutedGenome ▴ 50