I am new to Python. I want to create a new GFF3 file from an existing file by filtering out features that are smaller than 1 kb.
Is there a way to parse a GFF3 file containing feature information of semicolon separated tags as well as start and end of the feature, and create a new file altogether?
As your analysis gets more complex, you might find gffutils useful for accessing GFF attributes. It has a fairly robust GFF/GTF parser. Here's an example that only keeps >1kb features that are in the LINE superfamily (as you suggest is your goal in a comment on another answer).
This is a simple perl script to filter by size which I had lying around which could do that, but python would be just as simple to learn for the long run. I recommend www.pythonforbiologists.com
#!/usr/bin/perl
use strict;
use warnings;
my $usage = "## USAGE: filter_csv.pl infile.txt [min value]\n";
my $infile = $ARGV[0] or die $usage;
my $filter = $ARGV[1] or die $usage;
open INFILE, "<", "$infile" or die "$usage | check $infile\n";
my $n=0;
while (my $line = <INFILE>){
$n++;
if ($n == 1){
## Don't process on header line (Maybe you don't need this bit)
print $line;
}else{
my @l = split(/\t/, $line);
if ($l[4] - $l[3] >= $filter){
print $line;
}
}
}
You would run it like so:
./filter_csv.pl my_file.gff 1000
You could add in a filter for the right term by using a similar if clause and adding it in the size loop.
EDIT: For example (but not tested):
while (my $line = <INFILE>){
$n++;
if ($n == 1){
## Don't process on header line (Maybe you don't need this bit)
print $line;
}else{
my @l = split(/\t/, $line);
if ($l[4] - $l[3] >= $filter) && ($line ~= /$term/`){
print $line;
}
}
}
ADD COMMENT
• link
updated 2.4 years ago by
Ram
44k
•
written 9.6 years ago by
Daniel
★
4.0k
Since you asked for Python, this simple script should work.
import sys
for line in sys.stdin:
line = line.rstrip()
values = line.split("\t")
start, end = int(values[3]), int(values[4])
length = end - start + 1
if length >= 1000:
print line
It was not tested on a file containing comments or pragmas, but that should be easy to account for.
In your question, you mention semicolon-separated attributes. If you're trying to further process the data (rather than just filtering out lines < 1 kb), you'll need to be more specific about what you need.
Thanks for your reply. I implemented the method and it worked. Yes, I also need to keep everything with attribute SuperFamily=LINES and filter out everything else. I hoped there was an implementation with Biopython, but I guess I can work with re and search.
No need for regular expressions or the search function there: a simple change will suffice. Change if length >= 1000: to if length >= 1000 and "SuperFamily=LINES" in values[8]:.
Thanks for your reply. I implemented the method and it worked. Yes, I also need to keep everything with attribute SuperFamily=LINES and filter out everything else. I hoped there was an implementation with Biopython, but I guess I can work with re and search.
No need for regular expressions or the search function there: a simple change will suffice. Change
if length >= 1000:
toif length >= 1000 and "SuperFamily=LINES" in values[8]:
.