I have a file in GFF format and I need to convert it to BED format. What do I do?
I have a file in GFF format and I need to convert it to BED format. What do I do?
Both formats are tab delimited text files used to represent DNA features in genomes. The order of columns between the two are different, there are also columns that correspond to attributes missing from one or the other format. Nonetheless the most important difference between the two is the coordinate systems that they assume.
The BED format developed at UCSC
uses a zero based indexing and an open end interval whereas the GFF format developed at Sanger
assumes a 1 based coordinate system that includes both start and end coordinates. Therefore
The [0,100]
interval in BED
format corresponds to [1,100]
in GFF
format and both are 100
base long. That the first element in BED format will be have the index of 0
where the last 100th
element will have the index of 99
! Whereas in GFF
the first element will have the index of 1
and the last element will have the index of 100
.
To convert between the two you may use Galaxy and select the section called Select Formats
that will list various transformation options.
You can also convert it from galaxy:
Go to 'Convert formats' and you will find a 'BED-to-GFF converter'.
HDF-5 is a generic format for big datasets, which can be used for several applications, from astrophysics to scRNA data. It's conceptually similar to a zip file containing several files and folders. As such, there is not a single way to convert it to plink - it depends on which data is inside the HDF-5 file and in what format.
Here's a Perl script I wrote if you wanted to do something local.
There's some code in there for translating yeast chromosome names that can be removed, if not needed. I also used a Site
feature in the GFF file as the region ID, which might also need tweaking, depending on what features you're interested in.
#!/usr/bin/perl -w
use strict;
use Bio::Tools::GFF;
use feature qw(say switch);
my $gffio = Bio::Tools::GFF->new(-fh => \*STDIN, -gff_version => 2);
my $feature;
while ($feature = $gffio->next_feature()) {
# print $gffio->gff_string($feature)."\n";
# cf. <http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml>
my $seq_id = $feature->seq_id();
my $start = $feature->start() - 1;
my $end = $feature->end();
my $strand = $feature->strand();
my @sites = $feature->get_tag_values('Site');
# translate strand
given ( $strand ) {
when ($_ == 1) { $strand = "+"; }
when ($_ == -1) { $strand = "-"; }
}
# translate yeast chromosome to UCSC browser-readable chromosome
# cf. <http://www.yeastgenome.org/sgdpub/Saccharomyces_cerevisiae.pdf>
given ( $seq_id ) {
when ( $_ eq "I" ) { $seq_id = "chr1"; }
when ( $_ eq "II" ) { $seq_id = "chr2"; }
when ( $_ eq "III" ) { $seq_id = "chr3"; }
when ( $_ eq "IV" ) { $seq_id = "chr4"; }
when ( $_ eq "V" ) { $seq_id = "chr5"; }
when ( $_ eq "VI" ) { $seq_id = "chr6"; }
when ( $_ eq "VII" ) { $seq_id = "chr7"; }
when ( $_ eq "VIII" ) { $seq_id = "chr8"; }
when ( $_ eq "IX" ) { $seq_id = "chr9"; }
when ( $_ eq "X" ) { $seq_id = "chr10"; }
when ( $_ eq "XI" ) { $seq_id = "chr11"; }
when ( $_ eq "XII" ) { $seq_id = "chr12"; }
when ( $_ eq "XIII" ) { $seq_id = "chr13"; }
when ( $_ eq "XIV" ) { $seq_id = "chr14"; }
when ( $_ eq "XV" ) { $seq_id = "chr15"; }
when ( $_ eq "XVI" ) { $seq_id = "chr16"; }
default { }
}
# output
print "$seq_id\t$start\t$end\t$sites[0]\t0.0\t$strand\n";
}
$gffio->close();
To use it:
gff2bed.pl < data.gff > data.bed
There is also galaxy which offers the solution highlighted below then you can also take a look at this link for the python script which can perform the same trick. Also take a look at this link
I recently developed bed2gff to quickly convert .bed files to a gff3 format, a tool written in Rust. Could be of help here!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I'll answer my own question here as it is a demo for now
someone should write a well-documented, well-tested python module / script to do this! many current converters either discard CDS information or include every CDS or mRNA on its own line. would be nice if it this script had an option to include CDSs on the same line using the extended bad format.
https://bitbucket.org/galaxy/galaxy-central/src/61b09dc1dff2/tools/filters/bed_to_gff_converter.py
Used the perl script submitted from Alex Reynolds and it worked absolutely fine.
The question here is exactly the opposite of the title of the question