Entering edit mode
4.7 years ago
supertech
▴
180
How do I parse "FEATURES -> CDS-> /coded_by=" " in Genbank entry, which would be the following in my example:
" /coded_by="AF180730.1:21..3083 ".
I need to grab the accession number and manipulate the coordinates for each gb record. Perl, Python or any other language solution is ok though I prefer Python. Thanks.
LOCUS AF180730_1 1020 aa linear INV 07-NOV-1999
DEFINITION RNA interference promoting factor RDE-1 [Caenorhabditis elegans].
ACCESSION AAF06159
VERSION AAF06159.1
DBSOURCE accession AF180730.1
KEYWORDS .
SOURCE Caenorhabditis elegans
ORGANISM Caenorhabditis elegans
Eukaryota; Metazoa; Ecdysozoa; Nematoda; Chromadorea; Rhabditida;
Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis.
REFERENCE 1 (residues 1 to 1020)
AUTHORS Tabara,H., Sarkissian,M., Kelly,W.G., Fleenor,J., Grishok,A.,
Timmons,L., Fire,A. and Mello,C.C.
TITLE The rde-1 gene, RNA interference, and transposon silencing in C.
elegans
JOURNAL Cell 99 (2), 123-132 (1999)
PUBMED 10535731
REFERENCE 2 (residues 1 to 1020)
AUTHORS Tabara,H., Sarkissian,M., Kelly,W.G., Grishok,A., Timmons,L.,
Fire,A. and Mello,C.C.
TITLE Direct Submission
JOURNAL Submitted (25-AUG-1999) Medical School, Program in Molecular
Medicine, University of Massachusetts, 373 Plantation Street,
Worcester, MA 01605, USA
COMMENT Method: conceptual translation supplied by author.
FEATURES Location/Qualifiers
source 1..1020
/organism="Caenorhabditis elegans"
/db_xref="taxon:6239"
/chromosome="V"
/clone="yk296b10"
Protein 1..1020
/product="RNA interference promoting factor RDE-1"
/name="similar to Caenorhabditis elegans K08H10.7"
Region 280..419
/region_name="PAZ_argonaute_like"
/note="PAZ domain, argonaute_like subfamily. Argonaute is
part of the RNA-induced silencing complex (RISC), and is
an endonuclease that plays a key role in the RNA
interference pathway. The PAZ domain has been named after
the proteins Piwi,Argonaut, and Zwille; cd02846"
/db_xref="CDD:239212"
Site order(349,364,381,385,402,409,411)
/site_type="other"
/note="nucleic acid-binding interface [nucleotide
binding]"
/db_xref="CDD:239212"
Region 467..982
/region_name="Piwi_ago-like"
/note="Piwi_ago-like: PIWI domain, Argonaute-like
subfamily. Argonaute is the central component of the
RNA-induced silencing complex (RISC) and related
complexes. The PIWI domain is the C-terminal portion of
Argonaute and consists of two subdomains, one of...;
cd04657"
/db_xref="CDD:240015"
Site order(632,636,648..651,654,671,674,678,682)
/site_type="other"
/note="5' RNA guide strand anchoring site"
/db_xref="CDD:240015"
Site order(718,720,801,974)
/site_type="active"
/db_xref="CDD:240015"
CDS 1..1020
/gene="rde-1"
/coded_by="AF180730.1:21..3083"
ORIGIN
1 mssnfpelek gfyrhsldpe mkwlarptgk cdgkfyekkv lllvnwfkfs skiydreyye
61 yevkmtkevl nrkpgkpfpk kteipipdra klfwqhlrhe kkqtdfiled yvfdekdtvy
121 svcrlntvts kmlvsekvvk kdsekkdekd lekkilytmi ltyrkkfhln fsrenpekde
181 eanrsykflk nvmtqkvrya pfvneeikvq faknfvydnn silrvpesfh dpnrfeqsle
241 vaprieawfg iyigikelfd gepvlnfaiv dklfynapkm slldyllliv dpqscnddvr
301 kdlktklmag kmtirqaarp rirqllenlk lkcaevwdne msrlterhlt fldlceensl
361 vykvtgksdr grnakkydtt lfkiyeenkk fiefphlplv kvksgakeya vpmehlevhe
421 kpqryknrid lvmqdkflkr atrkphdyke ntlkmlkeld fsseelnfve rfglcsklqm
481 iecpgkvlke pmlvnsvneq ikmtpvirgf qekqlnvvpe kelccavfvv netagnpcle
541 endvvkfyte liggckfrgi riganenrga qsimydatkn eyafyknctl ntgigrfeia
601 ateaknmfer lpdkeqkvlm fiiiskrqln aygfvkhycd htigvanqhi tsetvtkala
661 slrhekgskr ifyqialkin aklgginqel dwseiaeisp eekerrktmp ltmyvgidvt
721 hptsysgidy siaavvasin pggtiyrnmi vtqeecrpge ravahgrert dileakfvkl
781 lrefaenndn rapahivvyr dgvsdsemlr vshdelrslk sevkqfmser dgedpepkyt
841 fiviqkrhnt rllrrmekdk pvvnkdltpa etdvavaavk qweedmkesk etgivnpssg
901 ttvdklivsk ykfdfflash hgvlgtsrpg hytvmyddkg msqdevykmt yglaflsarc
961 rkpislpvpv hyahlsceka kelyrtykeh yigdyaqprt rhemehflqt nvkypgmsfa
//
If /coded_by= is only found in Features -> CDS you can easily grep them:
"AF180730.1:21..3083"
/coded_by="AF180730.1:21..3083"
This only works if you don't have any other lines that starts with /coded_by= that you don't want.
This code will grep from all the gp files in the folder and output it to the resultfile
Note that >> appends to the file, so if you want to re-run it delete the previous resultfile or change the name, .. if you wanna see how the output looks without sending it to a file you can remove
>> resultfile
If you have your list of files in a file you can use this command
You can also take a look at
grep -oP
option.You should also be able to use BioPython to do this task: