I have tried this script but did not work.
Do you guys have any working method to convert gtf in to bed format ?
Thanks
I have tried this script but did not work.
Do you guys have any working method to convert gtf in to bed format ?
Thanks
BEDOPS includes a gtf2bed
conversion utlity, which is lossless in that it permits reconversion back to GTF after, for example, applying set and statistical operations with bedops
, bedmap
, etc.:
$ gtf2bed < foo.gtf > foo.bed
Apply some operations, perhaps to build a subset of elements that overlap some ad-hoc regions-of-interest, e.g.:
$ bedops --element-of 1 foo.bed regions_of_interest.bed > foo_subset.bed
To reconvert, a simple awk
statement puts columns back into GTF-ordering, along with the correct, 1-based coordinate index adjustment:
$ awk `{ print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10))) }' foo_subset.bed > foo_subset.gtf
My solution, based on Ian's answer:
zcat ../../../data/annotations/gencode.v24.annotation.gtf.gz | awk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,$16,$7}}' | tr -d '";' | head
chr1 11868 14408 ENSG00000223972.5 . +
chr1 14403 29569 ENSG00000227232.5 . -
chr1 17368 17435 ENSG00000278267.1 . -
chr1 29553 31108 ENSG00000243485.3 . +
chr1 30365 30502 ENSG00000274890.1 . +
chr1 34553 36080 ENSG00000237613.2 . -
chr1 52472 53311 ENSG00000268020.3 . +
chr1 62947 63886 ENSG00000240361.1 . +
chr1 69090 70007 ENSG00000186092.4 . +
chr1 89294 133722 ENSG00000238009.6 . -
Gives you all the genes, with their name, in bed format.
You can use the score field to store other info you are interested in, like the common gene name:
zcat ../../../data/annotations/gencode.v24.annotation.gtf.gz | awk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,$16,$7}}' | tr -d '";' | head
chr1 11868 14408 ENSG00000223972.5 DDX11L1 +
chr1 14403 29569 ENSG00000227232.5 WASH7P -
chr1 17368 17435 ENSG00000278267.1 MIR6859-1 -
chr1 29553 31108 ENSG00000243485.3 RP11-34P13.3 +
chr1 30365 30502 ENSG00000274890.1 MIR1302-2 +
chr1 34553 36080 ENSG00000237613.2 FAM138A -
chr1 52472 53311 ENSG00000268020.3 OR4G4P +
chr1 62947 63886 ENSG00000240361.1 OR4G11P +
chr1 69090 70007 ENSG00000186092.4 OR4F5 +
chr1 89294 133722 ENSG00000238009.6 RP11-34P13.7 -
You could use a simple AWK one-liner (Linux):
$ cat file.gtf | awk '{print $1,$4,$5,"name",$6,$7}'
$1
is the first column of your TAB delimited GTF file, $2
is the second column, $3
is the third, etc. Not sure what you would use a name, I guess you could use $3
.
EDIT:
If you don't like the command line then Galaxy has a tool "ConvertFormats > GFF-to-BED". The tool does use $3
as the name.
gtf format is 1-based start: http://www.ensembl.org/info/website/upload/gff.html
bed format is 0-based start: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
So this solution will get all coordinates wrong by one base.
GTF annotates transcript and exon information in separate rows. If you use awk to print just columns what you get is start end of exon or transcript separately but not together as in BED12 format http://genome.ucsc.edu/FAQ/FAQformat.html.
I found this handy link for bed files: https://github.com/stevekm/reference-annotations
Does not work anymore.
$ make ensembl-hg38
wget ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr.gtf.gz
--2021-03-02 12:19:45-- ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr.gtf.gz
=> 'Homo_sapiens.GRCh38.91.chr.gtf.gz'
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.197.76
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.197.76|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-91/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.91.chr.gtf.gz ... 41854359
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.91.chr.gtf.gz ... done.
Length: 41854359 (40M) (unauthoritative)
100%[=========================================================================>] 41,854,359 1.25MB/s in 23s
2021-03-02 12:20:10 (1.75 MB/s) - 'Homo_sapiens.GRCh38.91.chr.gtf.gz' saved [41854359]
zcat Homo_sapiens.GRCh38.91.chr.gtf.gz | grep -Ev '^#' | grep -w 'gene' | sed -e 's/^/chr/' -e 's/^chrMT/chrM/' > Homo_sapiens.GRCh38.91.chr.gtf
gtf2bed < Homo_sapiens.GRCh38.91.chr.gtf > Homo_sapiens.GRCh38.91.chr.bed
Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)
make: *** [Makefile:75: Homo_sapiens.GRCh38.91.chr.bed] Error 61
rm Homo_sapiens.GRCh38.91.chr.gtf Homo_sapiens.GRCh38.91.chr.gtf.gz
Hi, Alternatively use pygtftk (here using CLI):
gtftk get_example | gtftk convert -f bed -n feature,gene_id,transcript_id
There are additional arguments that may be helpful:
gtftk get_example | gtftk convert -f bed -n feature,gene_id,transcript_id -s '^' -m 'a_test'
Best
Disclosure I'm the pygtftk developer.
needed to google it myself but like the approach! https://github.com/dputhier/pygtftk
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
gtf2bed from bedops do not work for GENCODE comprehensive gtf file if there are features without transcript ID in the attributes:
This is a long-standing problem with research groups putting out malformed GTF for some still-unexplained reason. See A: BEDOPS gtf2bed conversion error with Ensembl GTF for a potential solution.
Had this today. The solution that worked for me was:
cat input.gtf.gz | gunzip - | grep transcript_id | grep gene_id | convert2bed --do-not-sort --input=gtf - > output.bed
The gawk command to make BED to GTF is fine. However, it would be a complete round trip of convenience if there is
bed2gtf
command in BEDOPS :)It would require some assumptions about how conversion was done. So long as the BED data were created with gtf2bed, it would be easier to make those assumptions, however.