Extracting Specific Regions In Genbank File
2
0
Entering edit mode
11.9 years ago
pixie@bioinfo ★ 1.5k

I have the following file format: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord#FeaturesA

and I want to extract the information contained under the CDS regions (there can be many CDS regions and they need not be consecutive)

CDS               <1..206
                 /codon_start=3
                 /product="TCP1-beta"
                 /protein_id="AAA98665.1"
                 /db_xref="GI:1293614"
                 /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
                 AEVLLRVDNIIRARPRTANRQHM"
 gene            687..3158
                 /gene="AXL2"
 CDS             687..3158
                 /gene="AXL2"
                 /note="plasma membrane glycoprotein"
                 /codon_start=1
                 /function="required for axial budding pattern of S.
                 cerevisiae"
                 /product="Axl2p"
                 /protein_id="AAA98666.1"
                 /db_xref="GI:1293615"

I have already read the file and have kept each line in a list

l_cds=[]   #list to hold the info under cds
for index in range(len(list_eachline)):
    if list_eachline[index].startswith("CDS"):
        l_cds.append(list_eachline[index])

this is extracting only the first line of the CDS region..how do modify the code ?...thanks (I am not supposed to use BioPython)

python • 5.3k views
ADD COMMENT
0
Entering edit mode

Can we assume this to be a homework question, since you are "not supposed to use BioPython"?

ADD REPLY
0
Entering edit mode

Its just a part of a project..and it was suggested not to use BioPython because it would become too simple. I have coded for a similar thing in Perl. And its not that I am asking for a readymade code..I have checked through the StackOverflow as well and everywhre they talk about BioPython. Hence I asked it here !

ADD REPLY
5
Entering edit mode
11.9 years ago

Modify your code a little bit:

l_cds=[]   #list to hold the info under cds
isCDS = False
for index in range(len(list_eachline)):
    if list_eachline[index].startswith("CDS"):
        isCDS = True
    elif not list_eachline[index].startswith(" "):
        isCDS = False
    if isCDS:
        l_cds.append(list_eachline[index])
ADD COMMENT
0
Entering edit mode

Thanks so much !

ADD REPLY
1
Entering edit mode
11.9 years ago

If your Genbank files are straight forward (i.e. not linking to other Genbank entries) you can use Biopieces www.biopieces.org) to create a table:

read_genbank -i test.gb -k AC -f CDS | write_tab -k ACCESSION,S_BEG,S_END,STRAND,SEQ,TRANSLATION -x
ADD COMMENT
0
Entering edit mode

thank you for the suggestion...but I am just curious to know if there is a way to do it with just Python

ADD REPLY
0
Entering edit mode

I am sure Biopython can do it. Read the docs :o). If you want to roll your own GenBank parser: good luck.

ADD REPLY

Login before adding your answer.

Traffic: 2314 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6