Counting how many transcripts per gene in gtf file
2
0
Entering edit mode
4.7 years ago
sallyey2 • 0

Hi I want to count the number of transcripts per gene in GTF file, using python 2 on Ubuntu 18.04. Could you help me how to count how many transcripts per gene in GTF file? Your help is much appreciated!

I am using a public gtf.gz file from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz and the output gtf file briefly looks like below, but it's truncated when I copied it from Ubuntu. Each row has more data including gene_type, gene_name, transcript_type, etc.

##description: evidence-based annotation of the human genome (GRCh38), version 33 (Ensembl 99)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-12-13
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; lev
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unproc
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unproc
chr1    HAVANA  exon    12010   12057   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12179   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12613   12697   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12975   13052   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    13221   13374   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    13453   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  gene    14404   29570   .       -       .       gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; level 2; hgnc_id
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudo
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; g
rna-seq gene python gtf • 3.5k views
ADD COMMENT
0
Entering edit mode

Please do not delete posts once they have received a comment or an answer. If an answer has solved your problem please accept it by using the green check mark to provide closure to this thread.

ADD REPLY
1
Entering edit mode
4.7 years ago
liorglic ★ 1.4k

Here is a quick option with awk:

awk -F "\t" '$3 == "transcript" {split($9,a,";"); print a[1]}' | sort | uniq -c

but you may also want to check out a library that parses gtf/gff in your favorite programming language.

ADD COMMENT
0
Entering edit mode
4.7 years ago
wm ▴ 570

First, you did not paste the entire lines above. What you see contains gene_type, ... , is the complete version GTF record.

Second, number of transcripts per gene. the 3-column is transcript, and parse the "gene_id" and "transcript_id", use dict to count transcripts for each gene.

#!/usr/bin/env python3
# filename: aaa.py
# python3

import os
import sys
import gzip

gtf_file = 'input.gtf.gz'

d = {}
with gzip.open(gtf_file, 'rt') as r:
    for line in r:
        if line.startswith('#'):
            continue
        fields = line.strip().split('\t')
        if not fields[2] == 'transcript':
            continue
        gene = tx = ''
        for spec in fields[8].split(';'):
            if spec.startswith('gene_id'):
                gene = spec.replace('gene_id ', '')
            elif spec.startswith('transcript_id'):
                tx = spec.replace('transcript_id ', '')
            else:
                continue
        d[gene] = d.get(gene, 0) + 1

i = 1
for k, v in d.items():
   if i > 3:
       break
   print(k, v)
   i += 1

Here are the result:

$ python aaa.py gencode.v33.annotation.gtf.gz
"ENSG00000223972.5" 2
"ENSG00000227232.5" 1
"ENSG00000278267.1" 1
ADD COMMENT

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6