How to pull gene identifier from genbank file in Biopython
1
1
Entering edit mode
9.3 years ago
mdelow ▴ 20

Hello!

I am very new to the world of python (or any coding), but have installed biopython which I believe will help me do what I am trying to accomplish. I am working with RNAseq data for my project, and wanting to run my genelists through DAVID to look at enrichment. The output of the RNAseq pipeline thru bowtie/tophat/cuffdiff, gives me log fold increases for each gene, with the chromosome identifier of SMaxxxxx. That would be fine, but DAVID does not seem to be able to handle the fact that my organism has 3 replicons (1 chromosome and 2 megaplasmids), that are SMa, SMb and SMc. When I try to use these identifiers as the gene list, DAVID is only able to find 1/3 (ie. one replicon) and identify them as genes. I was playing around and it seems like it can identify genes from all replicons if I use the gene identifier such as lacI, manB, etc. In the genbank file the gene notation I want is labelled: gene="xxxX"

I want to do two things:

  1. Go through each replicons genbank file and pull out the gene identifier for each gene in the 4 letter format
  2. Feed in a list of 150 genes in the SMaxxxx format, and have it output the 4 letter gene identifier for each gene

So far I have been able to read in a genbank file, but not pull out the information I want.

#to read in list of genes I want to pull out
with open('synthgenelistfordavid.txt') as f:
    content = f.readlines()

wanted = [line.rstrip('\n') for line in open('synthgenelistfordavid.txt')]

#then I've gotten this far, import file and set up a for loop

from Bio import SeqIO
genbank_file = "sma.gbk"

for record in SeqIO.parse("sma.gbk", "genbank"):
genbank parsing biopython • 9.2k views
ADD COMMENT
0
Entering edit mode

Hi mdelow,

For the first question : You can use BioPython to parse a genBank file. You have to know that, with BioPython if you want to have an access to a specific part of the file, you have to read the CookBook to find exactly what are you looking for and its name in the documentation.

An alternative can be, if you have a "FLAG" to localize your information in the file, it is to develop you own Python Script. In you case, what are you interested in is the flag \gene="XXX". So, with the if my_pattern in current_line of python, you can try it.

Warning : If you have various genes in the file, you need to take it in account! ;)

For instance, in the CookBook, part 4.2.3, you have an example of the records objects attributes and what they are containing. I think not that exists an attribute "gene_name" with the "XXX".

For the second one, can you please clarify the point?

ADD REPLY
0
Entering edit mode

Doing some more reading, what I want is the gene="XXX", the identifier I have is the locus_tag. For my organism, the locus_tag is in the format SMaXXXX - but the name of that identifier is locus_tag - sorry I hadn't figured that out earlier.

So what I am trying to do in both cases is use a list of locus tags and loop thru every coding sequence in the genbank file.

The list for #2 is roughly 150 locus tags, the list for #1 is more like 6,000 locus tags (every gene in the genome).

So for #2, if the locus_tag in each CDS matches a locus_tag in my list, I want it to print that locus_tag and the corresponding gene identifier.

Same thing for #1, but the reality is in that case, that it is just all the genes. I was thinking I could use a list of all the genes to keep the same "list" approach to the problem, but another approach would be to ask biopython to print the locus_tag and corresponding gene ID for every CDS in the genbank file.

My genbank file only has one LOCUS, but many coding sequences (CDS).

Here is an example of some CDS side by side in the file. All CDS have a locus_tag, not all CDS have a gene name, because many genes in the genome are not identified genes, they just code for "hypothetical proteins". For the downstram process of using the DAVID tool, I can't use hypothetical proteins anyways, so only the locus_tags with gene names are useful.

CDS             complement(12259..12546)

                     /locus_tag="SMa5000"
                     /codon_start=1
                     /protein_id="ACF07957.1"
                     /translation="MSSNVSIISPWNMEFIRSVLGEAGYNAKLLVDDPRSFETAELLV
                     TKLFLAGEDSRPGLAAKLECQLGKAGGYRRPPGSSLARYAIQGLPSELQFS"
                     /product="hypothetical protein"
                     /transl_table=11
     repeat_region   12658..13716
                     /locus_tag="SMa3000"
                     /mobile_element="insertion:ISRm2011-2/ISRm11"
                     /note=" predicted by homology"
     gene            12754..13161
                     /locus_tag="SMa0018"
     CDS             12754..13161
                     /locus_tag="SMa0018"
                     /codon_start=1
                     /protein_id="AAK64667.1"
                     /translation="MARPFSNDLRERVVDAVTGEGLSCRAAAKRFGIGISTAIDWVRR
                     FRETGSAAPGQMGGHKPRKLSGPHRAWLLCRCRERDFTLHGLVAELSERGLKVDYRAV
                     WTFVHEEGLSYKKRRWSPANGSGPTSPATGHDG"
                     /product="TRm2011-2a transposase"
                     /transl_table=11

gene            9252..11252
                     /locus_tag="SMa0015"
                     /gene="selB"
     CDS             9252..11252
                     /locus_tag="SMa0015"
                     /protein_id="AAK64665.1"
                     /gene="selB"
                     /transl_table=11
                     /codon_start=1
                     /product="Selenocysteine-specific elongation factor"
                     /translation="MIVGTAGHIDHGKTTLVKALTGVDTDRLKEEKARGITIDLGFAY
                     ARFAKDAVTGFVDVPGHERFIHTMLAGAGGIDYAMLVVAADDGIKPQTLEHLAILDLL
                     GVSRGLVAITKADLADPARLENLTDEIGAVLSSTSLRDAEILPVSVAAGQGIELLKER
                     LAAAECATAASAVGGRFRLAVDRSFTLSGAGTVVTGTVLSGSVGVGDQVTVSPAGRSA
                     RVRSIHAQNQRAERGFAGQRCALNLAGEGISKDAITRGDMVVDPHLHAPSDRLDADLS
                     VLESETKPIGEWFSARFHHASAETGIRIVPLEGPLLPGERRRVQLVLDRPIAAAVGDR
                     FILRDVSARRTIGGGRLLDLRAPARKRRSPERLSYLQAASLSHAGEALAALLDVPPFL
                     VDLDVFARDRALSEAELQNAIRSASAEVIEGSAVRHALSKRQRAAFSDEVQRVLSAFH
                     VENPDLQGIGRERLRLQVTPRLPPPAFLVALRAEQTAGRLVLEGAFVRLPGHEVRLSE
                     KEEELYARILPHLEGEERFRPPRVRDFAEALGVDEREIRRILKLCARLGRVDQIRHDH
                     FFTRQTTAEMVAIIRQVAANAERGEFSAGLFRDRVNNGRKVAIEILEFFDRQGVTIRH
                     GDVRRVNPHRLDLYEGPVPEADEGRGSSPVERPDFKFYRAGS"

Is this making more sense?

ADD REPLY
0
Entering edit mode

I solved it by doing this:

with open('synthgenelistfordavid.txt') as f:
    content = f.readlines()

wanted = [line.rstrip('\n') for line in open('synthgenelistfordavid.txt')]

file=open("smcgenes.csv", "w")

from Bio import SeqIO
genbank_file = "smc.gbk" 
for record in SeqIO.parse("smc.gbk", "genbank"):
    for f in record.features:
    if f.type == "gene" and "locus_tag" in f.qualifiers:
        locus_tag = f.qualifiers["locus_tag"][0]
        if locus_tag in wanted:
        try:
            out1 = f.qualifiers["locus_tag"][0] + "," + f.qualifiers["gene"][0] + "\n"
            file.write(out1)
        except:
            out2 = f.qualifiers["locus_tag"][0] + ", no gene name" + "\n"
            file.write(out2)
ADD REPLY
0
Entering edit mode
9.3 years ago
Peter 6.0k

I think you would find this helpful, specifically the last section http://www.warwick.ac.uk/go/peter_cock/python/genbank/

Try something like this for map from the locus tags to the gene qualifiers:

from Bio import SeqIO
filename = "NC_005816.gb"
locus_to_gene = dict()
for record in SeqIO.parse(filename, "genbank"):
    for f in record.features:
        if f.type == "CDS":
            if "gene" in f.qualifiers:
                if "locus_tag" in f.qualifiers:
                    genes = f.qualifiers["gene"]
                    locus_tags = f.qualifiers["locus_tag"]
                    assert len(genes) == 1, genes
                    assert len(locus_tags) == 1, locus_tags
                    locus_to_gene[locus_tags[0]] = genes[0]
print("Mapped %i locus tags to genes" % len(locus_to_gene))
ADD COMMENT
0
Entering edit mode

I had already read this. I see that it's somewhat related, but do not know how to use that to accomplish what I am looking for.

ADD REPLY
0
Entering edit mode

I've expanded on my answer based on the discussion above.

ADD REPLY
0
Entering edit mode

If I am understanding this correctly, it would be printing a list of all of the locus tags and corresponding gene names in the genbank file?

ADD REPLY
0
Entering edit mode

This returns: Mapped 0 locus tags to genes

When I run this:

from Bio import SeqIO
filename = "sma.gbk"
locus_to_gene = dict()
for record in SeqIO.parse(filename, "genbank"):
    for f in record.features:
        if f.type == "CDS":
            if "gene" in f.qualifiers:
                if "locus_tag" in f.qualifiers:
                    genes = f.qualifiers["gene"]
                    locus_tags = f.qualifiers["locus_tag"]
                    assert len(genes) == 1, genes
                    assert len(locus_tags) == 1, locus_tags
                    locus_to_gene[locus_tags[0]] = genes[0]

print("Mapped %i locus tags to genes" % len(locus_to_gene))
ADD REPLY
0
Entering edit mode

Strange. Can you email me your GenBank file? My GoogleMail address in on my BioStars profile.

ADD REPLY
0
Entering edit mode

Thanks for your reply! It helps me a lot. Just a small question: my genbank file has "locus_tag" section but why biopython can not get access to it? A example:

     gene            complement(3015679..3037424)
                     /gene=ENSDARG00000043854
                     /locus_tag="ppil4"
                     /note="peptidylprolyl isomerase (cyclophilin)-like 4
                     [Source:ZFIN;Acc:ZDB-GENE-030131-6251]"
     mRNA            join(complement(3037328..3037424),
                     complement(3036319..3036386),complement(3035701..3035765),
                     complement(3035066..3035183),complement(3031417..3031559),
                     complement(3029838..3029934),complement(3028920..3029036),
                     complement(3028691..3028815),complement(3025931..3025997),
                     complement(3023570..3023681),complement(3020984..3021080),
                     complement(3020081..3020198),complement(3015679..3016986))
                     /gene="ENSDARG00000043854"
                     /note="transcript_id=ENSDART00000064390"
     CDS             join(complement(3037328..3037397),
                     complement(3036319..3036386),complement(3035701..3035765),
                     complement(3035066..3035183),complement(3031417..3031559),
                     complement(3029838..3029934),complement(3028920..3029036),
                     complement(3028691..3028815),complement(3025931..3025997),

When I try to use f.qualifiers["locus_tag"], there is KeyError. How can I get /locus_tag="ppil4" by using biopython? The GenBank file I use: ftp://ftp.ensembl.org/pub/release-79/genbank/danio_rerio/Danio_rerio.Zv9.79.chromosome.20.dat.gz

ADD REPLY
0
Entering edit mode

The if "locus_tag" in f.qualifiers: condition should avoid the KeyError, see also Biopython: work with loops and qualifiers from multiple records

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6