I have a whole genome text file from which i want to extract only a particular product and locus tag of that product. I have to write a script in perl for the same. for exampe, i have a genome text file which looks like -
LOCUS NC_000962 4411532 bp DNA linear CON 14-DEC-2017
DEFINITION Mycobacterium tuberculosis H37Rv, complete genome.
ACCESSION NC_000962
VERSION NC_000962.3
DBLINK BioProject: PRJNA57777 Assembly: GCF_000195955.2
KEYWORDS RefSeq; complete genome.
SOURCE Mycobacterium tuberculosis H37Rv
ORGANISM Mycobacterium tuberculosis H37Rv Bacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis complex.
REFERENCE 1
AUTHORS Lew,J.M., Kapopoulou,A., Jones,L.M. and Cole,S.T.
TITLE TubercuList--10 years after
JOURNAL Tuberculosis (Edinb) 91 (1), 1-7 (2011)
PUBMED 20980199
REFERENCE 2
AUTHORS Camus,J.C., Pryor,M.J., Medigue,C. and Cole,S.T.
TITLE Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv
JOURNAL Microbiology (Reading, Engl.) 148 (PT 10), 2967-2973 (2002)
PUBMED 12368430
REFERENCE 3
AUTHORS Cole,S.T., Brosch,R., Parkhill,J., Garnier,T., Churcher,C., Harris,D., Gordon,S.V., Eiglmeier,K., Gas,S., Barry,C.E. III, Tekaia,F., Badcock,K., Basham,D., Brown,D., Chillingworth,T., Connor,R., Davies,R., Devlin,K., Feltwell,T., Gentles,S., Hamlin,N., Holroyd,S., Hornsby,T., Jagels,K., Krogh,A., McLean,J., Moule,S., Murphy,L., Oliver,K., Osborne,J., Quail,M.A., Rajandream,M.A., Rogers,J., Rutter,S., Seeger,K., Skelton,J., Squares,R., Squares,S., Sulston,J.E., Taylor,K., Whitehead,S. and Barrell,B.G. TITLE Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
JOURNAL Nature 393 (6685), 537-544 (1998)
PUBMED 9634230
REMARK Erratum:[Nature 1998 Nov 12;396(6707):190]
REFERENCE 4 (bases 1 to 4411532)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (06-FEB-2013) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA
REFERENCE 5 (bases 1 to 4411532)
AUTHORS Lew,J.M.
JOURNAL Submitted (18-DEC-2012) Lew J., Ecole Polytechnique Federale de Lausanne, CH-1015, Lausanne, Switzerland, and the Swiss Institute of Bioinformatics, CMU - Rue Michel-Servet 1, 1211 Geneva 4, SWITZERLAND
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence is identical to AL123456. On Feb 6, 2013 this sequence version replaced NC_000962.2. RefSeq Category: Reference Genome FGS: First Genome sequenced MOD: Model Organism TYS: Designated Type Strain UPR: UniProt Genome Note: This annotation is from the TubercuList website, Release 26, Dec 2012 (URL: http://tuberculist.epfl.ch) (email: tuberculist@epfl.ch). COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..4411532 /organism="Mycobacterium tuberculosis H37Rv" /mol_type="genomic DNA" /strain="H37Rv" /db_xref="taxon:83332"
gene 1..1524 /gene="dnaA" /locus_tag="Rv0001" /experiment="DESCRIPTION:Mutation analysis, gene expression[PMID: 10375628]" /db_xref="GeneID:885041"
CDS 1..1524 /experiment="EXISTENCE:Mass spectrometry[PMID:15525680]" /experiment="EXISTENCE:Mass spectrometry[PMID:21085642]" /experiment="EXISTENCE:Mass spectrometry[PMID:21920479]" /inference="protein motif:PROSITE:PS00017" /inference="protein motif:PROSITE:PS01008" /codon_start=1 /transl_table=11 /product="chromosomal replication initiator protein DnaA" /protein_id="NP_214515.1" /translation="MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ RAWLNLVQPLTIVEGFALLSVPSSFVQNEIERHLRAPITDALSRRLGHQIQLGVRIAP PATDEADDTTVPPSENPATTSPDTTTDNDEIDDSAAARGDNQHSWPSYFTERPHNTDS ATAGVTSLNRRYTFDTFVIGASNRFAHAAALAIAEAPARAYNPLFIWGESGLGKTHLL HAAGNYAQRLFPGMRVKYVSTEEFTNDFINSLRDDRKVAFKRSYRDVDVLLVDDIQFI EGKEGIQEEFFHTFNTLHNANKQIVISSDRPPKQLATLEDRLRTRFEWGLITDVQPPE LETRIAILRKKAQMERLAVPDDVLELIASSIERNIRELEGALIRVTAFASLNKTPIDK ALAEIVLRDLIADANTMQISAATIMAATAEYFDTTVEELRGPGKTRALAQSRQIAMYL CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR"
gene complement(77619..78896) /gene="glyA2" /locus_tag="Rv0070c" /db_xref="GeneID:886983"
CDS complement(77619..78896)
characterization[PMID:12913008]"
/inference="protein motif:PROSITE:PS00096"
/note="Belongs to the ShmT family. Cofactor: pyridoxal
phosphate."
/product="serine hydroxymethyltransferase"
/protein_id="NP_214584.1"
/translation="MNTLNDSLTAFDPDIAALIDGELRRQESGLEMIASENYAPLAVM
QAQGSVLTNKYAEGYPGRRYYGGCEFVDGVEQLAIDRVKALFGAEYANVQPHSGATAN
AATMHALLNPGDTILGLSLAHGGHLTHGMRINFSGKLYHATAYEVSKEDYLVDMDAVA
EAARTHRPKMIIAGWSAYPRQLDFARFRAIADEVDAVLMVDMAHFAGLVAAGVHPSPV
PHAHVVTSTTHKTLGGPRGGIILCNDPAIAKKINSAVFPGQQGGPLEHVIAAKATAFK
MAAQPEFAQRQQRCLDGARILAGRLTQPDVAERGIAVLTGGTDVHLVLVDLRDAELDG
QQAEDRLAAVDITVNRNAVPFDPRPPMITSGLRIGTPALAARGFSHNDFRAVADLIAA
ALTATNDDQLGPLRAQVQRLAARYPLYPELHRT"
gene 79486..80193
/locus_tag="Rv0071"
/db_xref="GeneID:886988"
CDS 79486..80193
/note="group II intron maturase family. Contains 5 VDP
repeats at N-terminus."
/product="maturase"
/protein_id="NP_214585.1"
/translation="MSSITVSVDPVDPVDPVDPVDPVDAVVAAGSDGLTVARIESEIG
ALEFLNELRTELKSGQFRPQPVRERKIPKPGGLGKVRRLGIPTVADRVVQAALKLVLE
PIFETDFEPVSYGFRPARRAHDTIAEIHLFGTQEYRWVLDADIKACFDRIDHADLMDR
VRHRIKDKRVLRLVNWQRIRHRWNWTDVRRWLTDPTGRWHPISADGITLFNPAAVPIR
RYRYRGNTIPTPWTQAV"
repeat_region 79507..79551
/note="5 x 9 bp GTGGACCCG repeats"
repeat_region 80236..80550
/note="(MTV030.15), len: 315 nt. Probable REP'-1
pseudogene fragment"
By fetching this file in perl i am able to read all the content of file and can print product name and locus tag but after that i can't print locus tag of a particular product which contains methyltransferase like this (/locus_tag="Rv0070c" - /product="serine hydroxymethyltransferase"). Can anyone help me to solve this problem.???
Would it be possible to share the perl script (if not too long)?
Are you using BioPerl to read the embl/genbank files?
duplicate post? How to fInd locus tag for a particular product and print both locus tag and product in perl?
I can probably help you in python, but not in perl. Why does it need to be perl?
That said, you can use
Bioperl
in (almost) exactly the same way asBioPython