Is there any tool/API available to convert GTF/GFF to JSON format?
4
2
Entering edit mode
4.0 years ago

Hi all

Before I start writing my own code to convert GTF file into a JSON format file, has anyone came across any API or tool to either convert or download the file is JSON ?

I am looking for a format like this - gene -> transcript -> exon :

"MOS": {
    "NM_005372.1": [
        {
            "exon_number": "1",
            "start": 57025501,
            "end": 57026541
        },
        {
            "exon_number": "1",
            "start": 57025504,
            "end": 57026541
gtf gff json ncbi ensembl • 2.8k views
ADD COMMENT
2
Entering edit mode
2.9 years ago
davmlaw ▴ 130

Hi, I have written a Python library called PyReference which does exactly that!

It comes with a command pyreference_gff_to_json.py to turn GTF or GFF (RefSeq and Ensembl) into a gzipped JSON file.

There's also a Python wrapper around the JSON, which allows you to write genomics code more naturally.

Python isn't known for being fast but the library function for reading a JSON file is highly optimised - the following takes less than 4 seconds on my laptop:

import numpy as np
import pyreference

reference = pyreference.Reference()

my_gene_ids = ["MSN", "GATA2", "ZEB1"]
for gene in reference[my_gene_ids]:
    average_length = np.mean([t.length for t in gene.transcripts])
    print("%s average length = %.2f" % (gene, average_length))
    print(gene.iv)
    for transcript in gene.transcripts:
        if transcript.is_coding:
            threep_utr = transcript.get_3putr_sequence()
            print("%s end of 3putr: %s" % (transcript.get_id(), threep_utr[-20:]))
ADD COMMENT
1
Entering edit mode
4.0 years ago

I wrote a gtf2xml http://lindenb.github.io/jvarkit/Gtf2Xml.html

Xml can be converted to json with a xslt stylesheet and xsltproc( see https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/pubmed2json.xsl for an example).

ADD COMMENT
1
Entering edit mode
4.0 years ago

Perhaps this answer might be of general use:

"Is there a JSON-based genomic feature format?" https://bioinformatics.stackexchange.com/questions/10386/is-there-a-json-based-genomic-feature-format/10387#10387

Using an existing format with a stable schema may be a better approach, especially if you will share these files with others.

ADD COMMENT
1
Entering edit mode
4.0 years ago
vkkodali_ncbi ★ 3.8k

NCBI Datasets produces data report in json format that may contain all of the information you seek. You can download the command line tool and try it out as follows:

datasets download gene gene-id 5768 --filename test.zip

One of the output files ncbi_datasets/data/data_report.jsonl has the following:

"transcripts": [
    {
    "accessionVersion": "NM_002826.5",
    "cds": {
        "accessionVersion": "NM_002826.5",
        "range": [
        {
            "begin": "40",
            "end": "2283"
        }
        ]
    },
    "ensemblTranscript": "ENST00000367602.8",
    "exons": {
        "accessionVersion": "NC_000001.11",
        "range": [
        {
            "begin": "180154869",
            "end": "180155172",
            "order": 1
        },
        {
            "begin": "180166491",
            "end": "180166591",
            "order": 2
        },

Not quite a tool to convert an existing GTF/GFF3 to json format, but if you are dealing with NCBI Gene annotation, this can be an option.

ADD COMMENT

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6