extract only geneID and gene symbol from GTF file
7
7
Entering edit mode
9.6 years ago
catherine ▴ 250

Hi Everyone,

I want to get geneID(gene_id) and its associated gene symbol(gene_name) from a gtf file into a file that can be read into R.

My gtf file looks like this:

chr1    HAVANA  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1";
chr1    HAVANA  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "RP23-271O17.1-001"; level 2; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

Can anyone tell me how to write a little script (preferably python) to do it?

Any idea would be appreciated!

gtf python genome • 36k views
ADD COMMENT
6
Entering edit mode
9.6 years ago

One quick and dirty way to extract attributes:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    attr = dict(item.strip().split(' ') for item in line.split('\t')[8].strip('\n').split('') if item)
    sys.stdout.write("%s\n" % (attr['gene_id'].strip('\"') + '\t' + attr['gene_name'].strip('\"')))

To use:

$ ./extract_gtf_attributes_of_interest.py < foo.gtf > answer.txt

The result with your test input:

$ more answer.txt
​ENSMUSG00000102693.1    RP23-271O17.1
ENSMUSG00000102693.1    RP23-271O17.1

As mentioned, this is quick and dirty and does no error checking or input validation. Consider adding a try..except block to handle the case where an attribute field is missing a value (or the attribute field is missing altogether).

ADD COMMENT
16
Entering edit mode
7.4 years ago
jaro.slamecka ▴ 270

For those who want to use R directly, there is indeed a fast and efficient way to do it, using package "refGenome", first create and empty refGenome object, then read the GTF file in and finally subset the data.frame to only keep gene IDs and symbols

library(refGenome)
gtf = ensemblGenome()
read.gtf(gtf, filename="Mus_musculus.GRCm38.89.gtf")
genes = gtf@ev$genes[ ,c("gene_id","gene_name")]

Edit: At this point, it is true that refGenome is no longer available, here is a fast alternative:

gtf.file = "/path/to/Homo_sapiens.GRCh38.100.gtf"
gtf.gr = rtracklayer::import(gtf.file) # creates a GRanges object
gtf.df = as.data.frame(gtf.gr)
genes = unique(gtf.df[ ,c("gene_id","gene_name")])
library(data.table)
fwrite(genes, file="gene_ID.gene_name.txt", sep="\t")

(line 3 above looks like it's missing the opening bracket on gtf.gr, not sure how to fix it)

ADD COMMENT
0
Entering edit mode

This package was removed from CRAN on the 4th March 2019 (https://cran.r-project.org/web/packages/refGenome/index.html)

ADD REPLY
0
Entering edit mode

The link shows that this package is still available. I can install this package today (07/24/2019)

ADD REPLY
0
Entering edit mode

How can I extract the "genes" with the subset of data in csv/tab format? thanks

ADD REPLY
0
Entering edit mode

In R

write.csv(genes, file="genes.csv", sep=",", quote=FALSE)

Use sep="\t" for tab delimited

ADD REPLY
15
Entering edit mode
9.6 years ago
EagleEye 7.6k

This should work,

Ensembl GTF:

zcat Homo_sapiens.GRCh38.94.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[5]"\t"$7}' | sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_biotype "//'| sed 's/gene_name "//' | sed 's/gene_biotype "//' | sed 's/"//g' | sed 's/ //g' | sed '1igene_id\tGeneSymbol\tChromosome\tClass\tStrand' > Homo_sapiens.GRCh38.94_gene_annotation_table.txt

Output:

gene_id GeneSymbol  Chromosome  Class   Strand
ENSG00000223972 DDX11L1 1:11869-14409   transcribed_unprocessed_pseudogene  +
ENSG00000227232 WASH7P  1:14404-29570   unprocessed_pseudogene  -
ENSG00000278267 MIR6859-1   1:17369-17436   miRNA   -
ENSG00000243485 MIR1302-2HG 1:29554-31109   lincRNA +
ENSG00000284332 MIR1302-2   1:30366-30503   miRNA   +
ENSG00000237613 FAM138A 1:34554-36081   lincRNA -
ENSG00000268020 OR4G4P  1:52473-53312   unprocessed_pseudogene  +
ENSG00000240361 OR4G11P 1:57598-64116   transcribed_unprocessed_pseudogene  +
ENSG00000186092 OR4F5   1:65419-71585   protein_coding  +

Gencode GTF:

zcat gencode.v28.annotation.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[2]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}' | sed "1i\Geneid\tGeneSymbol\tChromosome\tStart\tEnd\tClass\tStrand\tLength"  > gencode.v28_gene_annotation_table.txt

Output:

Geneid  GeneSymbol  Chromosome  Start   End Class   Strand  Length
ENSG00000223972.5    DDX11L1    chr1    11869   14409    transcribed_unprocessed_pseudogene +   2540
ENSG00000227232.5    WASH7P chr1    14404   29570    unprocessed_pseudogene -   15166
ENSG00000278267.1    MIR6859-1  chr1    17369   17436    miRNA  -   67
ENSG00000243485.5    MIR1302-2HG    chr1    29554   31109    lincRNA    +   1555
ENSG00000284332.1    MIR1302-2  chr1    30366   30503    miRNA  +   137
ENSG00000237613.2    FAM138A    chr1    34554   36081    lincRNA    -   1527
ENSG00000268020.3    OR4G4P chr1    52473   53312    unprocessed_pseudogene +   839
ENSG00000240361.2    OR4G11P    chr1    57598   64116    transcribed_unprocessed_pseudogene +   6518
ENSG00000186092.6    OR4F5  chr1    65419   71585    protein_coding +   6166
ADD COMMENT
0
Entering edit mode

What should have worked??

ADD REPLY
0
Entering edit mode

Adapting your code to just give the first two columns without a header (usage:
zcat my.gtf.gz | ./script_below > ids_and_names.txt
)

#!/bin/bash
#WARNING: This assumes that for "gene" features, the third attribute is the gene name (if present)
# If the third attribute is not the gene name, then the gene id (first column) is used.
cat $@ | \ 
    awk 'BEGIN{FS="\t"} \
         { \
             split($9,a,";"); \
             b=a[3]~"gene_name"?a[3]:a[1]; \
             if($3~"gene") print a[1]"\t"b \
         }' | \ 
    sed 's/gene_id "//g' | \ 
    sed 's/gene_name "//' | \ 
    sed 's/"//g' | \ 
    sed 's/ //g' | \ 
    sort

EXPLANATION:

  • cat $@ | will send either an input filename or incoming standard input into the pipe
  • \ line continuation
  • awk 'code' executes the awk code inside the single quotes
    • BEGIN{FS="\t"} sets the field separator to be tab characters
    • { start of code block
    • split($9,a,";") take the 9th field and using the delimiter ;, split its contents into the array a (i.e. the attributes)
    • ; separator between awk statements
    • b=a[3]~"gene_name"?a[3]:a[1]; assign b the third attribute if it contains the text gene_name, otherwise the first attribute (for more about the ternary construct x?y:z, see Wikipedia page)
    • if($3~"gene") print a[1]"\t"b if the third field matches gene, then print the first attribute, a tab, and then b
    • } end of code block
  • sed 's/gene_id "//g' replace all literal instances of gene_id " with nothing
  • sed 's/gene_name "//' replace first instance of gene_name " with nothing
  • sed 's/"//g' replace all double quotes with nothing
  • sed 's/ //g' replace all spaces with nothing
  • sort sort the results

and of course there are pipes connecting everything

(edited)

And here it is again, but this time it captures attributes by name, so it handles gene_id and gene_name being in any order:

#!/usr/bin/bash -l 
cat $@ | \ 
    sed 's/"//g' | `# remove quotes` \
    awk 'function notempty_or(x,y) {return x?x:y} \
         BEGIN{FS="\t"} \
         { \
            if($3!~"gene") { next }; \
            split($9, a, "; "); \
            for (i in a) { \
                split(a[i], j, " "); \
                attr[ j[1] ] = j[2]; \
            };\
            gene_id = notempty_or(attr["gene_id"],"NA"); \
            gene_name = notempty_or(attr["gene_name"],gene_id); \
            print gene_id "\t" gene_name ; \
            attr["gene_id"]=""; \
            attr["gene_name"]=""; \
         }' | \ 
    sed 's/;$//' |  `# remove terminal semicolons` \
    sed --regexp-extended 's/;(\s)/\1/' | `# remove semicolon before space character, but keep the space` \
    sort
ADD REPLY
0
Entering edit mode

Hello, thank you for your answer. I used your code with a little modification:

awk 'BEGIN{FS="\t"; OFS="\t"}
{split($9,x,";");
y=x[3]~"gene_name"?x[3]:x[1];
if($3=="gene") print x[1]"\t"y
}' GRCh38.gtf |
sed 's/gene_id//g' |
sed 's/gene_name//' |
sed 's/"//g' |
sed 's/ //g' |
sort |
uniq > GRCh38_gene_id_name.tsv

Basically, it works, but there is a problem.

ENSG00000224192.2 SEZ6L-AS1
ENSG00000224194.1 ENSG00000224194

The version number is missed if the gene doesn't has name. I am confused what is the problem. Do you have any ideas?

ADD REPLY
0
Entering edit mode

Do yourself a favor and load this into R with data.frame(rtracklayer::import(gtf)) and then use standard subsetting operations (same in python or any other higher language) rather than messing with these multiline awk codes.

u is your intended output. Example:

x <- data.frame(rtracklayer::import("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz"))
u <- unique(x[,c("gene_id", "gene_name")])
ADD REPLY
0
Entering edit mode

This is convinent. Thank you.

ADD REPLY
0
Entering edit mode

I have figured it out. Because if the gene doesn't have a name, the gene_name attribute is gene_id without version number in the gtf file.

ADD REPLY
2
Entering edit mode
9.6 years ago
Tariq Daouda ▴ 220

Hi,

I have written a GTF parser, it is part of pyGeno but you can also download it separately from github. It should make the extraction easier.

from pyGeno.tools.parsers.GTFTools import GTFFile

gtf = GTFFile(gtfFilePath)
for line in gtf :
   print line["gene_id"], line["gene_name"]

Cheers

ADD COMMENT
1
Entering edit mode
ADD COMMENT
1
Entering edit mode
6.3 years ago
Min Dai ▴ 160
awk 'BEGIN{FS="\t";OFS="\t"}{
  if($3=="exon"){
    print $5-$4+1,$9;
  }}' genes.gtf | tr -d ";\"" | awk 'BEGIN{FS=" ";OFS="\t"}{
    if($6=="p_id"){
      L[$9]+=$1;
    }else{
      L[$7]+=$1;
    }
    }END{
      for(i in L){
        print i"\t"L[i];
      }
      }' >genes_transcript_length.txt

I calculated length for both coding and noncoding transcripts.

Refer to: https://gist.github.com/sp00nman/10372555

ADD COMMENT
0
Entering edit mode
9.6 years ago

And why not using R directly?

The gtf file use to be either tabulated or space separated

You can run a read.table to get the gtf into R, and then, you can get rid of the undesired columns

ADD COMMENT
4
Entering edit mode

R is not an easy or fast parsing tool, and GTF attributes are contained within a column and require three levels of parsing per line. A command-line approach is probably better suited to this task.

ADD REPLY
0
Entering edit mode

I know perhaps I should make a separate question for this, but please bare with me. What command-line approach can be implemented to for example select only gene_id column in gtf file? I am aware of awk command, but as a newbie I have no clue how I would put this command together. Preferably a one-liner. Thanks!

ADD REPLY
1
Entering edit mode
$ gtf2bed < foo.gtf | cut -f4 > foo.gene_id_column.txt
ADD REPLY
1
Entering edit mode

Use above mentioned code from github. You will get GTF file in table format starting with geneid, chr_location (gene wise)...And also another file with transcriptid, chr_location (transcript wise)...

Or use this

A: Converting gtf format to bed format

This will give genewise table and a BED file. Both opproaches will work for you.

ADD REPLY

Login before adding your answer.

Traffic: 1359 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6