Hi all,
The statistics page of GENCODE Release 26 (https://www.gencodegenes.org/human/stats_26.html) states that it contains 19817 protein-coding genes (I think this is about the right number of protein-coding genes in the human genome)
However, the actual file is very confusing since it got transcript, lncRNA, exons, etc...
I want to extract a very simple information here: I just want all (protein-coding) genes in the human genome and their start - end coordinate (there should be 19817 of them)
How can I extract such information from the "gencode.v26.annotation.gtf" file?
Thank you so much!
Thank you!
I just want to clarify, So you mentioned that I should look for row where the "type" column has the value of "gene" (I assume this is the 3rd column (index = 2))
However, for a given gene_id, there are multiple entries that have different values under the "type" column For example, there are 140 entries corresponding to the TERT gene
So what are the other entries for? my understanding is that "gene" refer to the full length of the gene (including every exons and introns) whereas the first "exon" row corresponds to the coordinate of the first exon WITHIN the full length of the gene and so on..?