Question

Extracting Specific Regions In Genbank File

0

Entering edit mode

11.9 years ago

pixie@bioinfo ★ 1.5k

I have the following file format: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord#FeaturesA

and I want to extract the information contained under the CDS regions (there can be many CDS regions and they need not be consecutive)

CDS               <1..206
                 /codon_start=3
                 /product="TCP1-beta"
                 /protein_id="AAA98665.1"
                 /db_xref="GI:1293614"
                 /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
                 AEVLLRVDNIIRARPRTANRQHM"
 gene            687..3158
                 /gene="AXL2"
 CDS             687..3158
                 /gene="AXL2"
                 /note="plasma membrane glycoprotein"
                 /codon_start=1
                 /function="required for axial budding pattern of S.
                 cerevisiae"
                 /product="Axl2p"
                 /protein_id="AAA98666.1"
                 /db_xref="GI:1293615"

I have already read the file and have kept each line in a list

l_cds=[]   #list to hold the info under cds
for index in range(len(list_eachline)):
    if list_eachline[index].startswith("CDS"):
        l_cds.append(list_eachline[index])

this is extracting only the first line of the CDS region..how do modify the code ?...thanks (I am not supposed to use BioPython)

python • 5.3k views

ADD COMMENT • link updated 11.9 years ago by Andrzej Zielezinski 11k • written 11.9 years ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

Can we assume this to be a homework question, since you are "not supposed to use BioPython"?

ADD REPLY • link 11.9 years ago by Neilfws 49k

0

Entering edit mode

Its just a part of a project..and it was suggested not to use BioPython because it would become too simple. I have coded for a similar thing in Perl. And its not that I am asking for a readymade code..I have checked through the StackOverflow as well and everywhre they talk about BioPython. Hence I asked it here !

ADD REPLY • link 11.9 years ago by pixie@bioinfo ★ 1.5k

score 5 · Answer 1 · 2013-01-28

5

Entering edit mode

11.9 years ago

Andrzej Zielezinski 11k

Modify your code a little bit:

l_cds=[]   #list to hold the info under cds
isCDS = False
for index in range(len(list_eachline)):
    if list_eachline[index].startswith("CDS"):
        isCDS = True
    elif not list_eachline[index].startswith(" "):
        isCDS = False
    if isCDS:
        l_cds.append(list_eachline[index])

ADD COMMENT • link 11.9 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks so much !

ADD REPLY • link 11.9 years ago by pixie@bioinfo ★ 1.5k

score 1 · Answer 2 · 2013-01-27

1

Entering edit mode

11.9 years ago

Martin A Hansen 3.0k

If your Genbank files are straight forward (i.e. not linking to other Genbank entries) you can use Biopieces www.biopieces.org) to create a table:

read_genbank -i test.gb -k AC -f CDS | write_tab -k ACCESSION,S_BEG,S_END,STRAND,SEQ,TRANSLATION -x

ADD COMMENT • link 11.9 years ago by Martin A Hansen 3.0k

0

Entering edit mode

thank you for the suggestion...but I am just curious to know if there is a way to do it with just Python

ADD REPLY • link 11.9 years ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

I am sure Biopython can do it. Read the docs :o). If you want to roll your own GenBank parser: good luck.

ADD REPLY • link 11.9 years ago by Martin A Hansen 3.0k