Question

Gff to fasta: Sort and extract sub-feature sequences from fasta in python

2

Entering edit mode

9.6 years ago

rimjhim.roy.ch ▴ 80

I have a GFF annotation for my sequences where many protein sub-features are overlapping. I want to extract the longest of those sub-features (i.e. one sub-feature for each sequence).

For example, the following is how my gff file looks:

chrA01:14912487-14917603(-)_Name=Name1    Source1    repeat    6    5116    .    -    .    ID=repeat_region1
chrA01:14912487-14917603(-)_Name=Name1    Source1    inverted    6    7    .    -    .    Parent=repeat_region1
chrA01:14912487-14917603(-)_Name=Name1    Source1    LTR    6    5116    .    -    .    ID=LTR_retrotransposon1;Parent=repeat_region1;similarity=95.70;seq_number=666
chrA01:14912487-14917603(-)_Name=Name1    Source1    ltrs    6    302    .    -    .    Parent=LTR_retrotransposon1
chrA01:14912487-14917603(-)_Name=Name1    Source2    protein    535    835    1.7e-16    -    .    Parent=LTR_retrotransposon1;reading_frame=0;name=INT_crm
chrA01:14912487-14917603(-)_Name=Name1    Source2    protein    535    874    0    -    .    Parent=LTR_retrotransposon1;reading_frame=0;name=INT_reina
chrA01:14912487-14917603(-)_Name=Name1    Source2    protein    547    772    1.9e-28    -    .    Parent=LTR_retrotransposon1;reading_frame=0;name=INT_del
chrA01:14912487-14917603(-)_Name=Name1    Source2    protein    1812    2124    8.3e-41    -    .    Parent=LTR_retrotransposon1;reading_frame=1;name=RNaseH_galadriel
chrA01:14912487-14917603(-)_Name=Name1    Source2    protein    1812    2127    4.6e-15    -    .    Parent=LTR_retrotransposon1;reading_frame=1;name=RNaseH_v_clade
chrA01:14912487-14917603(-)_Name=Name1    Source2    protein    1812    2127    1.2e-16    -    .    Parent=LTR_retrotransposon1;reading_frame=1;name=RNaseH_athila
chrA01:14912487-14917603(-)_Name=Name1    Source1    ltrs    4815    5116    .    -    .    Parent=LTR_retrotransposon1
chrA01:14912487-14917603(-)_Name=Name1    Source1    inverted    5115    5116    .    -    .    Parent=repeat_region1
###
chrA01:18337410-18342821(-)_Name=Name2    Source1    repeat    1    5411    .    -    .    ID=repeat_region2
chrA01:18337410-18342821(-)_Name=Name2    Source1    inverted    1    2    .    -    .    Parent=repeat_region2
chrA01:18337410-18342821(-)_Name=Name2    Source1    LTR    1    5411    .    -    .    ID=LTR_retrotransposon2;Parent=repeat_region2;similarity=98.93;seq_number=794
chrA01:18337410-18342821(-)_Name=Name2    Source1    ltrs    1    374    .    -    .    Parent=LTR_retrotransposon2
chrA01:18337410-18342821(-)_Name=Name2    Source2    protein    427    1435    0    -    .    Parent=LTR_retrotransposon2;reading_frame=1;name=INT_crm
chrA01:18337410-18342821(-)_Name=Name2    Source2    protein    439    1390    1.8e-38    -    .    Parent=LTR_retrotransposon2;reading_frame=1;name=INT_TF
chrA01:18337410-18342821(-)_Name=Name2    Source2    protein    481    1435    0    -    .    Parent=LTR_retrotransposon2;reading_frame=1;name=INT_v_clade
chrA01:18337410-18342821(-)_Name=Name2    Source2    protein    1657    1990    3e-18    -    .    Parent=LTR_retrotransposon2;reading_frame=1;name=RNaseH_v_clade
chrA01:18337410-18342821(-)_Name=Name2    Source2    protein    1657    1990    6.8e-17    -    .    Parent=LTR_retrotransposon2;reading_frame=1;name=RNaseH_csrn1
chrA01:18337410-18342821(-)_Name=Name2    Source2    protein    1657    1990    2.7e-35    -    .    Parent=LTR_retrotransposon2;reading_frame=1;name=RNaseH_reina
chrA01:18337410-18342821(-)_Name=Name2    Source1    ltrs    5038    5411    .    -    .    Parent=LTR_retrotransposon2
chrA01:18337410-18342821(-)_Name=Name2    Source1    inverted    5410    5411    .    -    .    Parent=repeat_region2
###

I want to extract the longest or the best RNaseH regions from each sequence. I am able to parse the gff to get RNase annotations. But, I am not sure how to proceed next:

from gffutils.iterators import DataIterator
from Bio import SeqIO

input_filename = 'mygff.gff'
infasta = list(SeqIO.parse('myseqs.fasta', 'fasta'))
features=DataIterator(input_filename)

for feature in features:
    if "name=RNaseH" in str(feature):
        print str(feature)
        #check lengths
        #check e-values

python gff biopython gffutils gff2fasta • 5.1k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by rimjhim.roy.ch ▴ 80

Ram · Accepted Answer · 2015-05-01

Your question is a little confusing, but I'll try my best, hopefully we'll get somewhere. Firstly I wonder (just of curiosity) how did you come across this gffutils tool. GFF is a flat text file. One would usually just write a parser to get relevant info from the file. But this gffutil thing is actually pretty cool and since you want to get actual sequences as well, this tool should do it all for you.

Firstly, are you familiar with python dir function? It returns a list of valid attributes for that object. e.g dir("string") will return all methods available to a "string". You can also use ipython to tab expand methods.This is a good exploratory exercise of new methods, particular for new tools like this one, since little info available online. You can further call help("string".endswith) to get info on how to apply particular method.

Do this for your own knowledge:

import sys                                                                         
from gffutils.iterators import DataIterator                                        

myGFF = sys.argv[1]                                                             
myFasta = sys.argv[2]                                                           

for feature in DataIterator(myGFF):                                                                                                 
    print dir(feature)                                                          
    break

This should return a list like this:

['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', 'astuple', 'attributes', 'bin', 'calc_bin', 'chrom', 'dialect', 'end', 'extra', 'featuretype', 'file_order', 'frame', 'id', 'keep_order', 'score', 'seqid', 'sequence', 'sort_attribute_values', 'source', 'start', 'stop', 'strand]

Don't worry about things with double underscore! All other things you should experiment with.

I believe this is along the line what you are asking for

import sys                                                                                                              
from gffutils.iterators import DataIterator                                                                             

myGFF = sys.argv[1]                                                                                                     
myFasta = sys.argv[2]                                                                                                   

for feature in DataIterator(myGFF):                                                                                     
    if feature.featuretype == "RNaseH":                                                                                             
        print feature.seqid, feature.featuretype, len(feature), feature.sequence(myFasta)

This will return first column value (usually chromosome name), the feature type (in you case RNaseH), the length of that feature and the most awesome thing in my opinion is this last method feature.sequence() which returns the sequence fragment of you requested feature, using coordinates from the gff file. That is really cool I think.

I'm not sure how to help you with "the longest/ the best RHaseH reagion" thing since it something very specific to you, but you should be able to impose some sort of test on len(feature) to tease out your sequence of interest.

p.s also have a look at this post How to parse GFF3 file and get start and end coordinates in Python