Hi, I'm trying to extract gene location information for certain genes across multiple bacteria. Currently I'm using this set up to retrieve the gene information:
from Bio import SeqIO, Entrez
Entrez.email = 'my@email.com'
# example is E. coli K-12 reference sequence
handle = Entrez.efetch(db="nuccore", id='NC_000913', rettype='fasta_cds_na', retmode="text")
record = SeqIO.parse(handle, "fasta")
for gene in record:
parse_location(gene.description)
with parse_location()
being a function I wrote myself in order to extract the start and end point of a gene on the sequence.
It is my understanding that with the use of SeqIO.parse
each gene
in the record
comes with features, including features for the start and end point of the gene. Below is the output of a random gene in NC_000913
:
ID: lcl|NC_000913.3_cds_NP_418812.1_4306
Name: lcl|NC_000913.3_cds_NP_418812.1_4306
Description: lcl|NC_000913.3_cds_NP_418812.1_4306 [gene=ytjC] [locus_tag=b4395] [db_xref=UniProtKB/Swiss-Prot:P0A7A2] [protein=putative phosphatase] [protein_id=NP_418812.1] [location=4633797..4634444] [gbkey=CDS]
Number of features: 0
Seq('ATGTTACAGGTATACCTAGTCCGCCACGGTGAAACGCAGTGGAACGCCGAGCGA...TAA')
My parse_location()
function works okay, but I'm worried about edge cases. Is there anyway to have biopython do it for me?
Within biopython's object model, features have a specific location start and end attribute.
If you take a look at around line 40 in the github gist here: Slicing Genbank File by 'gene_id' range
you can see the syntax at work.
Regarding edge cases, biopython should catch most mangled files or dodgy features, so you probably don't have to worry too much as long as you're using standard formats. Errors in the input files would be the most likely issue I think.
Depending on the feature type (e.g. CDS etc) you may find the locations are 'split' if they happen to span the end of a sequence, so that's a possible edge case to be on the lookout for, but biopython should still handle it fairly gracefully.
Thanks, I knew this was done by biopython somewhere somehow, but it turns out I was using the wrong rettype. I tried
'fasta_cds_na'
and this does get you all information on the genes on the sequence, but does not get parsed like in your example, but instead gives you what I got: zero features. I should have usedrettype='gbwithparts'
in order to get all information on for each gene on the requested sequence. Rettype'gb'
only gets you some information about the sequence, but not the genes.