Question

Re-Annotating An Existing Gff File?

0

Entering edit mode

12.3 years ago

Luca Beltrame ▴ 250

Among my data I have a GFF file that was generated by an external service I have no control on. The main... feature of this GFF file is that the annotations are semi-bogus, because they only list the chromosome ID and the coordinates, like

"NC_000014.8:76427018-76427632"

Now, I'd like to map these data back to the genes they were taken from (this GFF was made out of a targeted resequencing kit selection). I've tried using the UCSC but the best I can do is select the whole genes that fall within the coordinates, and I don't want that, rather the opposite: which genes are present in those coordinates.

Is there a straightforward solution to this? I don't mind writing any program for that (Python would be preferred, but R could do).

gff annotation • 5.4k views

ADD COMMENT • link updated 9.5 years ago by Biostar 20 • written 12.3 years ago by Luca Beltrame ▴ 250

score 4 · Answer 1 · 2012-11-13

The BEDOPS suite includes a script for converting GFF to BED. You could generate a file containing your genomic ranges-of-interest with this script. Separately, obtain a list of genes (GENCODE, etc.) and convert them to BED format.

Next, use the bedmap tool in the BEDOPS suite to map genes to ranges, i.e., calculating a result listing those genes located within your genomic ranges of interest.

To demonstrate, let's say your gene annotations are formatted as a BED file called annotations.bed, like so:

chr1    1000    1500    gene-1
chr1    4000    4300    gene-2
chr1    6200    6450    gene-3
chr1    9120    9675    gene-4

We're following UCSC's definition of a BED file, with the first column representing the chromosome name, the second and third columns the genomic coordinates, and the fourth column the ID or name of the annotation. We use a tab character (\t) as a field delimiter.

We do an ad-hoc search through this file to find genes on the chromosome chr1 which are contained within the range [2000, 7000), using bedmap as follows:

$ echo -e "chr1\t2000\t7000" | bedmap --echo --echo-map-id - annotations.bed
chr1    2000    7000|gene-2;gene-3

The result indicates that gene-2 and gene-3 overlap the BED coordinate range [2000, 7000) on chromosome chr1 by one or more bases. This default overlap criterion can be adjusted to be more stringent with the --fraction-{ref,map,both,either} operators, up to --exact for an exact matching of reference and map elements. Refer to the overlap portion of the bedmap documentation for more detail.

The echo -e command passes bedmap that range via standard input, which bedmap consumes with the - hyphen character. The bedmap tool then maps annotations to that input range and reports its answer.

Results are a semi-colon-delimited list of genes for that range, taken from the ID column of annotations.bed. This is because we used the --echo-map-id operator, which outputs mapped IDs. There are four --echo-map-* operators that offer the entire mapped annotation, or various pieces of it (ID, genomic range, scores). Please refer to the documentation for more detail.

The output from a bedmap operation is trivial to parse with Perl scripts or Java programs, through the usual split() or regular expression calls, etc.

Instead of a trivial example that uses echo and standard input, you can instead specify a BED file containing multiple ranges, each on their own line. In this application, this would be the GFF file that you converted to BED.

For example, let's say we have a file called rangesOfInterest.bed that contains two ranges. For simplicity, we only have two ranges, but you can specify as many ranges as you like, each on their own line. Further, ranges can have additional columns, like IDs or scores. But here is a simple input file:

chr1    2000    7000
chr1    5000    10000

You can then use bedmap as follows to get results containing the annotations of interest:

$ bedmap --echo --echo-map-id rangesOfInterest.bed annotations.bed
chr1    2000    7000|gene-2;gene-3
chr1    5000    10000|gene-3;gene-4

Further, if you want BED-formatted output, use the --delim operator in conjunction with --echo, as follows:

$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed

This replaces the pipe character (|) with a tab (\t), so that the output is a minimal BED-formatted result:

$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed
chr1    2000    7000    gene-2;gene-3
chr1    5000    10000   gene-3;gene-4

This is very useful if you want to pipe the results to another BEDOPS command or another utility that consumes BED-formatted data, like awk, Perl or Python scripts, etc.

BEDOPS tools are modular and written to accept standard input (such as what was done above with echo -e), so it is easy to perform further set or mapping operations with bedops and bedmap, respectively, or compress the output with starch, etc. in an extended pipeline. Piping standard output from one program to standard input of another program reduces overall I/O load on a computer and can improve the speed of calculations.

The bedmap tool is also useful for quantitative work. While the gene annotations in this example do not contain any numerical data, we might want to quickly count how many genes are binned within our ranges of interest, using the --count operator:

$ bedmap --echo --echo-map-id --count --delim '\t' rangesOfInterest.bed annotations.bed
chr1    2000    7000    gene-2;gene-3    2
chr1    5000    10000   gene-3;gene-4    2

Granted, this is a pretty mundane example, but you could imagine asking how many annotations are contained within a moving window over a genome. This would tell you where your annotations are concentrated, which could have biological significance.

Other numerical operators are available for bedmap, useful particularly if your map elements contain score information, like tag counts, p-values, expression levels, etc.

score 4 · Answer 2 · 2012-11-13

Seems everyone is asking how to annotate genomics data these days. Also see

score 0 · Answer 3 · 2012-11-13

0

Entering edit mode

12.3 years ago

Damian Kao 16k

So you want to find genes that overlap with the regions defined in your .gff file. You can convert both your .gff and the annotations .gff into .bed and find overlaps between the two with bedtools' intersectBed command.

ADD COMMENT • link 12.3 years ago by Damian Kao 16k