Question

How to parse GFF3 file and get start and end coordinates in Python

1

Entering edit mode

9.6 years ago

rimjhim.roy.ch ▴ 80

I am new to Python. I want to create a new GFF3 file from an existing file by filtering out features that are smaller than 1 kb.

Is there a way to parse a GFF3 file containing feature information of semicolon separated tags as well as start and end of the feature, and create a new file altogether?

My gff file looks like below:

seq_chr1    S-MART    match    158337    160567    .    -    .    Superfamily=LINE;Target=RIL-Map20 356 2619;ID=ms1_seq_chr1_RIL-Map20;Order=TE;Class=Unknown;Identity=93.9881;Name=ms1_seq_chr1_RIL-Map20

python parse gff3 • 12k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by rimjhim.roy.ch ▴ 80

Ram · Answer 1 · 2015-04-20

3

Entering edit mode

9.6 years ago

Ryan Dale 5.0k

As your analysis gets more complex, you might find gffutils useful for accessing GFF attributes. It has a fairly robust GFF/GTF parser. Here's an example that only keeps >1kb features that are in the LINE superfamily (as you suggest is your goal in a comment on another answer).

(edit: disclaimer, I'm the primary author)

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Ryan Dale 5.0k

Ram · Answer 2 · 2015-04-20

This is a simple perl script to filter by size which I had lying around which could do that, but python would be just as simple to learn for the long run. I recommend www.pythonforbiologists.com

#!/usr/bin/perl
use strict;
use warnings;

my $usage = "## USAGE:  filter_csv.pl infile.txt [min value]\n";

my $infile = $ARGV[0] or die $usage;
my $filter = $ARGV[1] or die $usage;

open INFILE, "<", "$infile" or die "$usage | check $infile\n";

my $n=0;

while (my $line = <INFILE>){
    $n++;
    if ($n == 1){
        ## Don't process on header line (Maybe you don't need this bit)
        print $line;
    }else{
        my @l = split(/\t/, $line);
        if ($l[4] - $l[3] >= $filter){
            print $line;
        }
    }
}

You would run it like so:

./filter_csv.pl my_file.gff 1000

You could add in a filter for the right term by using a similar if clause and adding it in the size loop.

EDIT: For example (but not tested):

while (my $line = <INFILE>){
    $n++;
    if ($n == 1){
        ## Don't process on header line (Maybe you don't need this bit)
        print $line;
    }else{
       my @l = split(/\t/, $line);
       if ($l[4] - $l[3] >= $filter) && ($line ~= /$term/`){
            print $line;
        }
    }
}

score 1 · Answer 3 · 2015-04-20

1

Entering edit mode

9.6 years ago

Daniel Standage 4.1k

Since you asked for Python, this simple script should work.

import sys
for line in sys.stdin:
    line = line.rstrip()
    values = line.split("\t")
    start, end = int(values[3]), int(values[4])
    length = end - start + 1
    if length >= 1000:
        print line

It was not tested on a file containing comments or pragmas, but that should be easy to account for.

In your question, you mention semicolon-separated attributes. If you're trying to further process the data (rather than just filtering out lines < 1 kb), you'll need to be more specific about what you need.

ADD COMMENT • link 9.6 years ago by Daniel Standage 4.1k

0

Entering edit mode

Thanks for your reply. I implemented the method and it worked. Yes, I also need to keep everything with attribute SuperFamily=LINES and filter out everything else. I hoped there was an implementation with Biopython, but I guess I can work with re and search.

ADD REPLY • link 9.6 years ago by rimjhim.roy.ch ▴ 80

0

Entering edit mode

No need for regular expressions or the search function there: a simple change will suffice. Change if length >= 1000: to if length >= 1000 and "SuperFamily=LINES" in values[8]:.

ADD REPLY • link 9.6 years ago by Daniel Standage 4.1k