How to parse Genbank features "/coded_by="
0
0
Entering edit mode
4.7 years ago
supertech ▴ 180

How do I parse "FEATURES -> CDS-> /coded_by=" " in Genbank entry, which would be the following in my example:

" /coded_by="AF180730.1:21..3083 ".

I need to grab the accession number and manipulate the coordinates for each gb record. Perl, Python or any other language solution is ok though I prefer Python. Thanks.

LOCUS       AF180730_1              1020 aa            linear   INV 07-NOV-1999
DEFINITION  RNA interference promoting factor RDE-1 [Caenorhabditis elegans].
ACCESSION   AAF06159
VERSION     AAF06159.1
DBSOURCE    accession AF180730.1
KEYWORDS    .
SOURCE      Caenorhabditis elegans
  ORGANISM  Caenorhabditis elegans
            Eukaryota; Metazoa; Ecdysozoa; Nematoda; Chromadorea; Rhabditida;
            Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis.
REFERENCE   1  (residues 1 to 1020)
  AUTHORS   Tabara,H., Sarkissian,M., Kelly,W.G., Fleenor,J., Grishok,A.,
            Timmons,L., Fire,A. and Mello,C.C.
  TITLE     The rde-1 gene, RNA interference, and transposon silencing in C.
            elegans
  JOURNAL   Cell 99 (2), 123-132 (1999)
   PUBMED   10535731
REFERENCE   2  (residues 1 to 1020)
  AUTHORS   Tabara,H., Sarkissian,M., Kelly,W.G., Grishok,A., Timmons,L.,
            Fire,A. and Mello,C.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (25-AUG-1999) Medical School, Program in Molecular
            Medicine, University of Massachusetts, 373 Plantation Street,
            Worcester, MA 01605, USA
COMMENT     Method: conceptual translation supplied by author.
FEATURES             Location/Qualifiers
     source          1..1020
                     /organism="Caenorhabditis elegans"
                     /db_xref="taxon:6239"
                     /chromosome="V"
                     /clone="yk296b10"
     Protein         1..1020
                     /product="RNA interference promoting factor RDE-1"
                     /name="similar to Caenorhabditis elegans K08H10.7"
     Region          280..419
                     /region_name="PAZ_argonaute_like"
                     /note="PAZ domain, argonaute_like subfamily. Argonaute is
                     part of the RNA-induced silencing complex (RISC), and is
                     an endonuclease that plays a key role in the RNA
                     interference pathway. The PAZ domain has been named after
                     the proteins Piwi,Argonaut, and Zwille; cd02846"
                     /db_xref="CDD:239212"
     Site            order(349,364,381,385,402,409,411)
                     /site_type="other"
                     /note="nucleic acid-binding interface [nucleotide
                     binding]"
                     /db_xref="CDD:239212"
     Region          467..982
                     /region_name="Piwi_ago-like"
                     /note="Piwi_ago-like: PIWI domain, Argonaute-like
                     subfamily. Argonaute is the central component of the
                     RNA-induced silencing complex (RISC) and related
                     complexes. The PIWI domain is the C-terminal portion of
                     Argonaute and consists of two subdomains, one of...;
                     cd04657"
                     /db_xref="CDD:240015"
     Site            order(632,636,648..651,654,671,674,678,682)
                     /site_type="other"
                     /note="5' RNA guide strand anchoring site"
                     /db_xref="CDD:240015"
     Site            order(718,720,801,974)
                     /site_type="active"
                     /db_xref="CDD:240015"
     CDS             1..1020
                     /gene="rde-1"
                     /coded_by="AF180730.1:21..3083"
ORIGIN      
        1 mssnfpelek gfyrhsldpe mkwlarptgk cdgkfyekkv lllvnwfkfs skiydreyye
       61 yevkmtkevl nrkpgkpfpk kteipipdra klfwqhlrhe kkqtdfiled yvfdekdtvy
      121 svcrlntvts kmlvsekvvk kdsekkdekd lekkilytmi ltyrkkfhln fsrenpekde
      181 eanrsykflk nvmtqkvrya pfvneeikvq faknfvydnn silrvpesfh dpnrfeqsle
      241 vaprieawfg iyigikelfd gepvlnfaiv dklfynapkm slldyllliv dpqscnddvr
      301 kdlktklmag kmtirqaarp rirqllenlk lkcaevwdne msrlterhlt fldlceensl
      361 vykvtgksdr grnakkydtt lfkiyeenkk fiefphlplv kvksgakeya vpmehlevhe
      421 kpqryknrid lvmqdkflkr atrkphdyke ntlkmlkeld fsseelnfve rfglcsklqm
      481 iecpgkvlke pmlvnsvneq ikmtpvirgf qekqlnvvpe kelccavfvv netagnpcle
      541 endvvkfyte liggckfrgi riganenrga qsimydatkn eyafyknctl ntgigrfeia
      601 ateaknmfer lpdkeqkvlm fiiiskrqln aygfvkhycd htigvanqhi tsetvtkala
      661 slrhekgskr ifyqialkin aklgginqel dwseiaeisp eekerrktmp ltmyvgidvt
      721 hptsysgidy siaavvasin pggtiyrnmi vtqeecrpge ravahgrert dileakfvkl
      781 lrefaenndn rapahivvyr dgvsdsemlr vshdelrslk sevkqfmser dgedpepkyt
      841 fiviqkrhnt rllrrmekdk pvvnkdltpa etdvavaavk qweedmkesk etgivnpssg
      901 ttvdklivsk ykfdfflash hgvlgtsrpg hytvmyddkg msqdevykmt yglaflsarc
      961 rkpislpvpv hyahlsceka kelyrtykeh yigdyaqprt rhemehflqt nvkypgmsfa
//
genbank biopython • 995 views
ADD COMMENT
0
Entering edit mode

If /coded_by= is only found in Features -> CDS you can easily grep them:

grep "/coded_by=" gp_file  | sed 's/.*\=//g'

"AF180730.1:21..3083"

 grep "/coded_by=" gp_file | sed 's/ //g'

/coded_by="AF180730.1:21..3083"

This only works if you don't have any other lines that starts with /coded_by= that you don't want.

This code will grep from all the gp files in the folder and output it to the resultfile

for s in `ls *.gp` ; do grep "/coded_by=" ${s}  | sed 's/.*\=//g' >> resultfile ; done

Note that >> appends to the file, so if you want to re-run it delete the previous resultfile or change the name, .. if you wanna see how the output looks without sending it to a file you can remove >> resultfile

If you have your list of files in a file you can use this command

cat list_file | while read line ; do  grep "/coded_by="  ${line} | sed 's/.*\=//g' >> resultfile ; done

You can also take a look at grep -oP option.

You should also be able to use BioPython to do this task:

ADD REPLY

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6