How to get gene list from GENCODE gtf file?
1
0
Entering edit mode
21 months ago

Hi all,

The statistics page of GENCODE Release 26 (https://www.gencodegenes.org/human/stats_26.html) states that it contains 19817 protein-coding genes (I think this is about the right number of protein-coding genes in the human genome)

However, the actual file is very confusing since it got transcript, lncRNA, exons, etc...

I want to extract a very simple information here: I just want all (protein-coding) genes in the human genome and their start - end coordinate (there should be 19817 of them)

How can I extract such information from the "gencode.v26.annotation.gtf" file?

Thank you so much!

protein-coding GENCODE gene • 853 views
ADD COMMENT
1
Entering edit mode
21 months ago
ATpoint 85k

Just go for the lines of column type with attribute gene and select column gene_type with attribute protein-coding. It's a great parsing exercise. Please try anything, happy helping debug on it.

ADD COMMENT
0
Entering edit mode

Thank you!

I just want to clarify, So you mentioned that I should look for row where the "type" column has the value of "gene" (I assume this is the 3rd column (index = 2))

However, for a given gene_id, there are multiple entries that have different values under the "type" column For example, there are 140 entries corresponding to the TERT gene

enter image description here

So what are the other entries for? my understanding is that "gene" refer to the full length of the gene (including every exons and introns) whereas the first "exon" row corresponds to the coordinate of the first exon WITHIN the full length of the gene and so on..?

ADD REPLY

Login before adding your answer.

Traffic: 1155 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6