Question

BEDOPS gtf2bed conversion error with Ensembl GTF

4

Entering edit mode

9.1 years ago

bioinformatics.cancer ▴ 260

Hi,

I am trying to convert a Canine gene annotation (GTF) file downloaded from Ensembl to BED file using the gtf2bed tool within the BEDOPS application. Using this command gives an error:

$ gtf2bed < Canis_familiaris.CanFam3.1.85_noheader.gtf > Canis_familiaris.CanFam3.1.85_noheader.bed
Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

I checked the first few lines of the GTF file and it seems to match up with the required format:

$ head Canis_familiaris.CanFam3.1.85_noheader.gtf
X       ensembl gene    1575    5716    .       +       .       gene_id "ENSCAFG00000010935"; gene_version "3"; gene_source "ensembl"; gene_biotype "protein_coding";
X       ensembl transcript      1575    5716    .       +       .       gene_id "ENSCAFG00000010935"; gene_version "3"; transcript_id "ENSCAFT00000017396"; transcript_version "3"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";
..
..

I looked at the source code on github for this tool and can see that is check for gene or transcript id and if not present gives this error. But the gene_id is present here in the first line, so not sure how it is reaching the error condition.

I would appreciate any help with troubleshooting this error.

Thank you,
Pankaj

gtf2bed ensembl bedops gtf • 21k views

ADD COMMENT • link updated 19 months ago by Alex Reynolds 36k • written 9.1 years ago by bioinformatics.cancer ▴ 260

4

Entering edit mode

2.2 years ago

DareDevil ★ 4.4k

You can generate BED files (from e.g. GTF file of the Ensembl release) by executing the following command in Linux Shell:

# For genes
grep -P "\tgene\t" your_ensembl.gtf | \
  cut -f1,4,5,7,9 | \
  sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
  awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$6,".",$4,$10,$12,$14 }' | \
  sort -k1,1 -k2,2n > your_ensembl.gene.bed

# For transcripts
grep -P "\ttranscript\t" your_ensembl.gtf | \
  cut -f1,4,5,7,9 | \
  sed 's/[[:space:]]/\t/g' | sed 's/[;|"]//g' | \
  awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$10,".",$4,$14,$16,$18 }' | \
  sort -k1,1 -k2,2n > your_ensembl.transcript.bed

ADD COMMENT • link updated 24 months ago by Ram 45k • written 2.2 years ago by DareDevil ★ 4.4k

0

Entering edit mode

In case you're using a version of grep without the -P argument, which enables Perl-compatible regular expressions, you can replace its invocation with a simple call to awk to filter for lines containing, e.g., "gene" or "transcript."

For example,

#!/bin/bash

cat /path/to/file.gtf \
    | awk '$3 == "gene"' \
    | cut -f1,4,5,7,9 \
    | sed 's/[[:space:]]/\t/g' \
    | sed 's/[;|"]//g' \
    | awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2-1,$3,$6,".",$4,$10,$12,$14 }' \
    | sort -k1,1 -k2,2n \
        > /path/to/file.bed

Here, I find the useless use of cat to be useful for quickly seeing what's going on—but YMMV.

ADD REPLY • link 23 months ago by kalavattam ▴ 380

2

Entering edit mode

9.1 years ago

lairdm ▴ 20

Good morning,

We've actually developed a tool at Ensembl to address issues like this quirk of GTF, File Chameleon [1], new in Ensembl 85.

Unfortunately a transcript_id is required even in gene records, as Alex says you can leave this blank, or the other solution I've commonly seen is to simply pad the record duplicating the gene_id in to the transcript_id.

File Chameleon will take any of the Ensembl flat files on our FTP site (GTF, GFF3, Fasta only so far) and transcribe it to correct these quirks, or other adjustments such as remapping to UCSC style chromosome names. If there's additional tweaking of our files that would be useful as part of retrieving them, we'd also love to hear suggestions. The tool is also available for offline use as well [2].

[1] http://www.ensembl.org/Homo_sapiens/Tools/FileChameleon?db=core

[2] https://github.com/FAANG/faang-format-transcriber

ADD COMMENT • link 9.1 years ago by lairdm ▴ 20

0

Entering edit mode

I've gotten a few reports about the GTF formatting error. Do you think Ensembl will adjust what it releases to meet spec, or was this done for some other reason?

ADD REPLY • link 7.9 years ago by Alex Reynolds 36k

score 22 · Accepted Answer · 2016-08-10

22

Entering edit mode

9.1 years ago

Alex Reynolds 36k

Note: This should no longer be an error with gtf2bed v2.4.40 and on.

I added more stringent GTF format validation to BEDOPS v2.4.20.

The error suggests that the first line is missing the transcript_id field. It has a gene_id field, as you note, but no transcript_id field. The GTF 2.2 specification indicates that this field is mandatory, though its value can be an empty string.

There are a couple solutions:

Use an older version of gtf2bed that doesn't apply this validation check (e.g., 2.4.19 or earlier)
Or, modify the GTF and add a placeholder field where none exists

I suggest the second solution. You could do the following:

$ awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' input.gtf | gtf2bed - > output.bed

This adds transcript_id ""; to lines in the GTF that do not contain that field, and leaves other lines unchanged.

The GTF that comes out of this awk statement is more valid, enough to get through the conversion step, and so it can be piped to gtf2bed to get BED as output.

ADD COMMENT • link 4.2 years ago by Alex Reynolds 36k

4

Entering edit mode

This method did not work for me. gtf2bed only outputs the gtf untouched .

ADD REPLY • link 8.9 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

I don't have enough information to debug, but one suggestion is that you put a tee statement in between awk and gtf2bed so that you can examine what comes out of awk: https://en.wikipedia.org/wiki/Tee_(command)

ADD REPLY • link 8.9 years ago by Alex Reynolds 36k

1

Entering edit mode

The first solution (awk) worked. Thanks!

ADD REPLY • link 9.1 years ago by bioinformatics.cancer ▴ 260

0

Entering edit mode

Why is that the lines for "gene" only cannot be used and requires lines to have "transcript_id"? I was thinking how to get a BED file with gene IDs.

ADD REPLY • link 4.2 years ago by Arindam Ghosh ▴ 550

1

Entering edit mode

The next version of the kit will remove this requirement, and it will also allow the user to select a different key for retrieval into the ID field. This is available now in the development branch, available via make and make install etc.

ADD REPLY • link 4.2 years ago by Alex Reynolds 36k

0

Entering edit mode

I still received the error as of 2.4.41, but your second solution above worked.

ADD REPLY • link 19 months ago by Nat.Nataren ▴ 100

0

Entering edit mode

You might now get a warning, but it should not quit with an error. Ensembl really need to fix their output to match specification, but so it goes.

ADD REPLY • link 19 months ago by Alex Reynolds 36k