Dear Biostars,
I am using a shell script to transform a GENCODE GTF file into smaller BED files that I use to annotate transcriptome (CAGE) with information such as promoter/intron/exon classification or gene name.
Just as a reminder, GENCODE looks like this:
$ zcat gencode.v23.annotation.gtf.gz | cut -c -80 | head
##description: evidence-based annotation of the human genome (GRCh38), version 2
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2015-07-15
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "trans
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "E
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "E
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "E
The kind of BED files I produce look like that:
$ head gencode.v23.annotation.bed
chr1 11368 12369 promoter 0 +
chr1 11858 11879 boundary 0 +
chr1 11868 12227 exon 0 +
chr1 11868 14409 gene 0 +
chr1 11868 14409 transcribed_unprocessed_pseudogene_DDX11L1 0 +
chr1 11999 12020 boundary 0 +
chr1 12009 12057 exon 0 +
chr1 12046 12067 boundary 0 +
chr1 12168 12189 boundary 0 +
chr1 12178 12227 exon 0 +
$ head gencode.v23.annotation.genes.bed
chr1 11868 14409 DDX11L1 0 +
chr1 14403 29570 WASH7P 0 -
chr1 17368 17436 MIR6859-1 0 -
chr1 29553 31109 RP11-34P13.3 0 +
chr1 30365 30503 MIR1302-2 0 +
chr1 34553 36081 FAM138A 0 -
chr1 52472 53312 OR4G4P 0 +
chr1 62947 63887 OR4G11P 0 +
chr1 69090 70008 OR4F5 0 +
chr1 89294 133723 RP11-34P13.7 0 -
Instead of maintaining a script by myself, I would love to use a commonly used, proof-tested, well-maintained tool. Do you have something to recommend to me?
Thanks!
As I just found this old post, I would like to comment that I am not using this script anymore. Instead I load the GENCODE file in R and parse it with Bioconductor in functions of the CAGEr package such as ranges2annot.