Extracting CDS using python
0
0
Entering edit mode
4.0 years ago

I have a large dataset of *gb files, and im using Bio.SeqIo.parse() for extracting cds and writing sequences on multifasta format, but it takes too much time. Is there a faster way to get that information using python? I need cds locus tag, sequence and annotation. I've took almost an hour for getting this information from 20 files.

genome Python Gbk • 3.7k views
ADD COMMENT
0
Entering edit mode

Can you share your code?

How big are your files?

From my experience, in some gb files the coding sequence is directly associated within the feature, if that is your case, than you can just read the feature, which should be blazing fast, once the SeqRecord is parsed. If not, then on each cds the translation must be done.

Either way, we are no clairvoyants here, so pls, if you can, share code and example data (one sequence suffice), otherwise, this is just gueswork.

ADD REPLY
0
Entering edit mode

Yes, Sure. My files have 9-12Mb Code:

def open_gbk(seq_file):
    sequence_list,id_list = [],[]
    for seq_record in SeqIO.parse(seq_file , 'genbank'):
        for feature in seq_record.features:
            if feature.type == "CDS":
                gene = feature
                feature_id = feature.qualifiers['locus_tag'][0]
                feature_product = feature.qualifiers['product'][0]
                feature_translation = feature.qualifiers['translation'][0]
                feature_sequence = feature.location.extract(seq_record).seq
                try:
                    if feature.qualifiers['db_xref'][0].split(":")[0] == "COG":
                        feature_cog = feature.qualifiers['db_xref'][0].split(":")[1]
                except:
                    feature_cog = "-" 

           sequence_list.append(sequence(feature_id,feature_sequence,feature_product,feature_translation,feature_cog))
  return sequence_list

example data: Dowload Link

ADD REPLY
0
Entering edit mode

Before this code i'm using the a class to store values, as follows:

class sequence:
    def __init__(self,seq_id,seq,seq_product,seq_translation,cog):
        self.id = seq_id
        self.seq = seq
        self.product =seq_product
        self.translation = seq_translation
        self.cog = cog
ADD REPLY
1
Entering edit mode

Ok, I've looked into your code, and single most expensive line is this one

feature_sequence = feature.location.extract(seq_record).seq

You don't specify, if you want to have the DNA or the protein sequence for your CDS, and originally I've thought that you want the protein sequence. If that is the case then just remove this line and you should be fine.

From what I know, the creation of the Seq and SeqRecord objects is expensive in Biopython (they, are however powerful). And what you do in the extract is that you create new object for each gene.

If you really need to make this fast, then move from these objects to strings. (Create a string from genome sequence, and extract from that.) You will however need to handle yourself the reverse complement, and maybe introns, if you need to worry about them.

ADD REPLY
0
Entering edit mode

I've solved this problem creating my own version of parsing, using python native string manipulation and my run time now is around 2 minutes. If you want to check it out, you can acess it here: https://github.com/Mxrcon/Biopytools/blob/master/pepnucfunction.py Do you think that the first parse is slow? Or just the extract? Thank you for the responses!

ADD REPLY
0
Entering edit mode

Hi, good for you and GJ for creating gbk parser. Although the Biopython parsers are not very fast, they are not worst. As I've wrote, the extract and the creating of the Seq and SeqRecord objects is what is slow here.

ADD REPLY

Login before adding your answer.

Traffic: 2154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6