Parsing GENCODE GTF to simpler BED files: am I reinventing the wheel ?
1
0
Entering edit mode
8.9 years ago
Charles Plessy ★ 2.9k

Dear Biostars,

I am using a shell script to transform a GENCODE GTF file into smaller BED files that I use to annotate transcriptome (CAGE) with information such as promoter/intron/exon classification or gene name.

Just as a reminder, GENCODE looks like this:

$ zcat gencode.v23.annotation.gtf.gz | cut -c -80 | head
##description: evidence-based annotation of the human genome (GRCh38), version 2
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2015-07-15
chr1    HAVANA    gene    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; gene_type "trans
chr1    HAVANA    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript
chr1    HAVANA    exon    11869    12227    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E
chr1    HAVANA    exon    12613    12721    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E
chr1    HAVANA    exon    13221    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E

The kind of BED files I produce look like that:

$ head gencode.v23.annotation.bed
chr1    11368    12369    promoter    0    +
chr1    11858    11879    boundary    0    +
chr1    11868    12227    exon    0    +
chr1    11868    14409    gene    0    +
chr1    11868    14409    transcribed_unprocessed_pseudogene_DDX11L1    0    +
chr1    11999    12020    boundary    0    +
chr1    12009    12057    exon    0    +
chr1    12046    12067    boundary    0    +
chr1    12168    12189    boundary    0    +
chr1    12178    12227    exon    0    +
$ head gencode.v23.annotation.genes.bed
chr1    11868    14409    DDX11L1    0    +
chr1    14403    29570    WASH7P    0    -
chr1    17368    17436    MIR6859-1    0    -
chr1    29553    31109    RP11-34P13.3    0    +
chr1    30365    30503    MIR1302-2    0    +
chr1    34553    36081    FAM138A    0    -
chr1    52472    53312    OR4G4P    0    +
chr1    62947    63887    OR4G11P    0    +
chr1    69090    70008    OR4F5    0    +
chr1    89294    133723    RP11-34P13.7    0    -

Instead of maintaining a script by myself, I would love to use a commonly used, proof-tested, well-maintained tool. Do you have something to recommend to me?

Thanks!

GTF BED GENCODE • 3.5k views
ADD COMMENT
0
Entering edit mode

As I just found this old post, I would like to comment that I am not using this script anymore. Instead I load the GENCODE file in R and parse it with Bioconductor in functions of the CAGEr package such as ranges2annot.

ADD REPLY
0
Entering edit mode
8.9 years ago

You could use the GTF option in BEDOPS convert2bed, or the equivalent wrapper script gtf2bed:

$ convert2bed -i gtf -o bed < foo.gtf > foo.bed
$ gtf2bed < foo.gtf > foo.bed

If you need columns in a certain ordering, or only some subset of BED columns, you can pipe the result to common Unix tools like cut and awk.

$ gtf2bed < foo.gtf | cut -f1-6 > foo.bed6
ADD COMMENT

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6