Question

How to extract all gene nucleotide sequences separately from multiple Genbank files, with the form ">gene_name organism_name?

0

Entering edit mode

3.8 years ago

jbt38 • 0

I have multiple Genbank files in one file, and wish to extract all the gene sequences for alignment.

I.e. if one Genbank file (for genus species1) contains gene1 & gene2, and another file (for genus species2) contains gene1 and gene8, I'm looking for an output of:

>gene1 genus species1
agtc...
>gene2 genus species1
agtc...
>gene1 genus species2
agtc...
>gene8 genus species2
agtc...

I have found many solutions which are alomost what I want but not quite e.g. the python code here https://www.researchgate.net/post/How_can_I_parse_a_GenBank_file_to_retrieve_specific_gene_sequences_with_IDs. But this returns the protein sequence, and you have to specify the gene exactly.

sequence alignment • 1.5k views

ADD COMMENT • link 3.8 years ago by jbt38 • 0

0

Entering edit mode

Please post example input and expected output from example input. In the mean time, try using following code this is the code for both nucleotide and aa. Edit the code as per your convenience:

#! /usr/bin/env python3
#import os
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq


_NTSequences = []
_AASequences = []

# For writing nucleotide sequences
for rec in SeqIO.parse("test.gb", "genbank"):
    id = "{}:{}{}".formatrec.id, rec.features[2].location, rec.description)
    sequence = SeqRecord(rec.seq, id=id, description="")
    _NTSequences.append(sequence)
    for feature in rec.features:
        if feature.type == "CDS":
            aaseq = Seq(*feature.qualifiers["translation"])
            aaid = "{}:{} {}".format"("rec.id, feature.location, rec.description)
            aasequence = SeqRecord(aaseq, id=aaid, description="")
            _AASequences.append(aasequence)


SeqIO.write(_NTSequences, "testNT.fa", 'fasta')
SeqIO.write(_AASequences, "testAA.fa", 'fasta')

ADD REPLY • link 3.8 years ago by cpad0112 21k

0

Entering edit mode

@cpad0112, Please help me to use your script. I mean I have used the command to execute your code as shown below (I have changed my genbank file name as test.gb as you mentioned in your code and named your python script as scri.py),

python3 scri.py

but end up with error as follows,

ile "scri.py", line 13
    id = "{}:{}{}".formatrec.id, rec.features[2].location, rec.description)
                                                                          ^
SyntaxError: invalid syntax

ADD REPLY • link 3.8 years ago by Kumar ▴ 120

0

Entering edit mode

After format, please add (. It is missing. @ Kumar

ADD REPLY • link 3.8 years ago by cpad0112 21k

0

Entering edit mode

Thank you @ cpad0112 I have revised (added ( after format) your script as you mentioned, however I am getting another error as follows

ga@ga214:~/Documents/data/xap98/gff_gbk_dna_protein/test$ python3 scri.py 
Traceback (most recent call last):
  File "scri.py", line 13, in <module>
    id = "{}:{}{}".formatrec.id, rec.features[2].location, rec.description)
IndexError: list index out of range

ADD REPLY • link 3.8 years ago by Kumar ▴ 120

0

Entering edit mode

without example input file, it is difficult to trouble shoot the issue. I tested the script locally with https://www.ncbi.nlm.nih.gov/nuccore/MK153192.1 gb file and both aa and nt sequences were written to hard disk.

See if following simple code works out:

#! /usr/bin/env python3
#import os
from Bio import SeqIO
import sys

for rec in SeqIO.parse("sequence.gb", "genbank"):
    print(">{0} {1}\n{2}".formatrec.id,rec.description,rec.seq))
    SeqIO.write(rec, sys.stdout, 'fasta')

Please add ( after format.

Print should print the NT sequence in flattened format and seqIO write should print out fasta in system paging format. Both print and seqIO would print the same information.

ADD REPLY • link 3.8 years ago by cpad0112 21k

0

Entering edit mode

Input example of only one Genbank files. Imagine there are loads. These are in a single file:

LOCUS       MK153192                1734 bp    DNA     linear   PLN 16-OCT-2019
DEFINITION  Tacca palmata voucher SAR:TA56 tRNA-Lys (trnK) gene, intron; and
            maturase K (matK) gene, complete cds; chloroplast.
ACCESSION   MK153192
VERSION     MK153192.1
KEYWORDS    .
SOURCE      chloroplast Tacca palmata
  ORGANISM  Tacca palmata
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; Liliopsida; Dioscoreales; Taccaceae;
            Tacca.
REFERENCE   1  (bases 1 to 1734)
  AUTHORS   Wong,S.Y. and Chua,K.S.
  TITLE     Phylogeny of Tacca (Taccaceae) and traits in reproductive
            structures, with description of a new Bornean species
  JOURNAL   Biodiversitas 20 (11), 3096-3118 (2019)
REFERENCE   2  (bases 1 to 1734)
  AUTHORS   Chua,K.S.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-NOV-2018) Department of Plant Science and
            Environmental Ecology, Faculty of Resource Science and Technology,
            UNIMAS, Jalan Datuk Mohammad Musa, Kota Samarahan, Sarawak 94300,
            Malaysia
FEATURES             Location/Qualifiers
     source          1..1734
                     /organism="Tacca palmata"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /specimen_voucher="SAR:TA56"
                     /db_xref="taxon:167568"
                     /PCR_primers="fwd_name: matk_19f, fwd_seq:
                     cgttctgaccatattgcactatg, fwd_name: matk_390f, fwd_seq:
                     cgatctattcattcaatatttc, rev_name: matk_1326r, rev_seq:
                     tctagcacacgaaagtcgaagt, rev_name: matk_2r, rev_seq:
                     aactagtcggatggagtag"
     gene            <1..>1734
                     /gene="trnK"
                     /note="tRNA-Lys"
     intron          <1..>1734
                     /gene="trnK"
     gene            1..1539
                     /gene="matK"
     CDS             1..1539
                     /gene="matK"
                     /codon_start=1
                     /transl_table=11
                     /product="maturase K"
                     /protein_id="QFP40046.1"
                     /translation="MEEFRGYLEKDGSRQRHFLFLYPLLFQEYIYALAYDHGLNGSIF
                     AEIFYYDNKSSSVLVKHLITRIYQQNYLIYSVDNSSQNRIVGHSNYFYSQMIMMSEGF
                     VLIMEIPFLLRLISSLEEKEIPKSQNLRSIHSIFPFLEDKLTHLTYVSDIVIPHPIHL
                     EILVQILQCWTQDVPSLHLLRFFLHEYPNWNSSITPKKSIYAISKENKRFFRFLYNSY
                     VFECEFFLVFFRKQSSYLRLTSSVAFLERTYFYGKIEHFILACLNYFQENLWFFKEPF
                     MHYVRYQGKVILSSKGTHLRMRKWRSYLVNFWQYYFQYWSQPHRIHINLLSNNSFCFW
                     GYLSSVLINHSVVRNQMLENFFLMDTVTKKFDTIVSVIPLIRSLSKAKFCTVCGHPIS
                     KPIWAHLSDWDIIYRFGRIYRNFSHYHSGSSKKKILYRIKYILRISCARTLARKHKST
                     SRAFLQRLGSGLLEEFFTEEEQILSLILPQVPFRSHTGRIWYLDIICININDLWNPSF
                     HDRA"
ORIGIN      
        1 atggaagaat tccgaggata tttagaaaaa gatggatcta ggcaaaggca cttcctattc...  (character limit)
//

Output. I added another species to show what I need but had to observe the character limit:

>trnK Musa rosea
acgt...
>matK Musa rosea
acgt...
>trnK Tacca palmata
acgt...
>matK Tacca palmata
acgt...

I will sort these later and align them for phylogenetic analysis

ADD REPLY • link 3.8 years ago by jbt38 • 0