How to fInd locus tag for a particular product and print both locus tag and product in perl?
2
0
Entering edit mode
6.8 years ago
bioinfo&me • 0

I have a whole genome text file from which i want to extract only a particular product and locus tag of that product. I have to write a script in perl for the same.
for example, i have a genome text file which looks like -

LOCUS       NC_000962            4411532 bp    DNA     linear   CON 14-DEC-2017
DEFINITION  Mycobacterium tuberculosis H37Rv, complete genome.
ACCESSION   NC_000962
VERSION     NC_000962.3
DBLINK      BioProject: PRJNA57777
            Assembly: GCF_000195955.2
KEYWORDS    RefSeq; complete genome.
SOURCE      Mycobacterium tuberculosis H37Rv
  ORGANISM  Mycobacterium tuberculosis H37Rv
            Bacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae;
            Mycobacterium; Mycobacterium tuberculosis complex.
REFERENCE   1
  AUTHORS   Lew,J.M., Kapopoulou,A., Jones,L.M. and Cole,S.T.
  TITLE     TubercuList--10 years after
  JOURNAL   Tuberculosis (Edinb) 91 (1), 1-7 (2011)
   PUBMED   20980199
REFERENCE   2
  AUTHORS   Camus,J.C., Pryor,M.J., Medigue,C. and Cole,S.T.
  TITLE     Re-annotation of the genome sequence of Mycobacterium tuberculosis
            H37Rv
  JOURNAL   Microbiology (Reading, Engl.) 148 (PT 10), 2967-2973 (2002)
   PUBMED   12368430
REFERENCE   3
  AUTHORS   Cole,S.T., Brosch,R., Parkhill,J., Garnier,T., Churcher,C.,
            Harris,D., Gordon,S.V., Eiglmeier,K., Gas,S., Barry,C.E. III,
            Tekaia,F., Badcock,K., Basham,D., Brown,D., Chillingworth,T.,
            Connor,R., Davies,R., Devlin,K., Feltwell,T., Gentles,S.,
            Hamlin,N., Holroyd,S., Hornsby,T., Jagels,K., Krogh,A., McLean,J.,
            Moule,S., Murphy,L., Oliver,K., Osborne,J., Quail,M.A.,
            Rajandream,M.A., Rogers,J., Rutter,S., Seeger,K., Skelton,J.,
            Squares,R., Squares,S., Sulston,J.E., Taylor,K., Whitehead,S. and
            Barrell,B.G.
  TITLE     Deciphering the biology of Mycobacterium tuberculosis from the
            complete genome sequence
  JOURNAL   Nature 393 (6685), 537-544 (1998)
   PUBMED   9634230
  REMARK    Erratum:[Nature 1998 Nov 12;396(6707):190]
REFERENCE   4  (bases 1 to 4411532)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (06-FEB-2013) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   5  (bases 1 to 4411532)
  AUTHORS   Lew,J.M.
  JOURNAL   Submitted (18-DEC-2012) Lew J., Ecole Polytechnique Federale de
            Lausanne, CH-1015, Lausanne, Switzerland, and the Swiss Institute
            of Bioinformatics, CMU - Rue Michel-Servet 1, 1211 Geneva 4,
            SWITZERLAND
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence is identical to AL123456.
            On Feb 6, 2013 this sequence version replaced NC_000962.2.
            RefSeq Category: Reference Genome
                        FGS: First Genome sequenced
                        MOD: Model Organism
                        TYS: Designated Type Strain
                        UPR: UniProt Genome
            Note:
            This annotation is from the TubercuList website, Release 26, Dec
            2012 (URL: http://tuberculist.epfl.ch) (email:
            tuberculist@epfl.ch).
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..4411532
                     /organism="Mycobacterium tuberculosis H37Rv"
                     /mol_type="genomic DNA"
                     /strain="H37Rv"
                     /db_xref="taxon:83332"
     gene            1..1524
                     /gene="dnaA"
                     /locus_tag="Rv0001"
                     /experiment="DESCRIPTION:Mutation analysis, gene
                     expression[PMID: 10375628]"
                     /db_xref="GeneID:885041"
     CDS             1..1524
                     /experiment="EXISTENCE:Mass spectrometry[PMID:15525680]"
                     /experiment="EXISTENCE:Mass spectrometry[PMID:21085642]"
                     /experiment="EXISTENCE:Mass spectrometry[PMID:21920479]"
                     /inference="protein motif:PROSITE:PS00017"
                     /inference="protein motif:PROSITE:PS01008"
                     /codon_start=1
                     /transl_table=11
                     /product="chromosomal replication initiator protein DnaA"
                     /protein_id="NP_214515.1"
                     /translation="MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ
                     RAWLNLVQPLTIVEGFALLSVPSSFVQNEIERHLRAPITDALSRRLGHQIQLGVRIAP
                     PATDEADDTTVPPSENPATTSPDTTTDNDEIDDSAAARGDNQHSWPSYFTERPHNTDS
                     ATAGVTSLNRRYTFDTFVIGASNRFAHAAALAIAEAPARAYNPLFIWGESGLGKTHLL
                     HAAGNYAQRLFPGMRVKYVSTEEFTNDFINSLRDDRKVAFKRSYRDVDVLLVDDIQFI
                     EGKEGIQEEFFHTFNTLHNANKQIVISSDRPPKQLATLEDRLRTRFEWGLITDVQPPE
                     LETRIAILRKKAQMERLAVPDDVLELIASSIERNIRELEGALIRVTAFASLNKTPIDK
                     ALAEIVLRDLIADANTMQISAATIMAATAEYFDTTVEELRGPGKTRALAQSRQIAMYL
                     CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR"
  gene            complement(77619..78896)
                     /gene="glyA2"
                     /locus_tag="Rv0070c"
                     /db_xref="GeneID:886983"
     CDS             complement(77619..78896)
                     characterization[PMID:12913008]"
                     /inference="protein motif:PROSITE:PS00096"
                     /note="Belongs to the ShmT family. Cofactor: pyridoxal
                     phosphate."
                     /product="serine hydroxymethyltransferase"
                     /protein_id="NP_214584.1"
                     /translation="MNTLNDSLTAFDPDIAALIDGELRRQESGLEMIASENYAPLAVM
                     QAQGSVLTNKYAEGYPGRRYYGGCEFVDGVEQLAIDRVKALFGAEYANVQPHSGATAN
                     AATMHALLNPGDTILGLSLAHGGHLTHGMRINFSGKLYHATAYEVSKEDYLVDMDAVA
                     EAARTHRPKMIIAGWSAYPRQLDFARFRAIADEVDAVLMVDMAHFAGLVAAGVHPSPV
                     PHAHVVTSTTHKTLGGPRGGIILCNDPAIAKKINSAVFPGQQGGPLEHVIAAKATAFK
                     MAAQPEFAQRQQRCLDGARILAGRLTQPDVAERGIAVLTGGTDVHLVLVDLRDAELDG
                     QQAEDRLAAVDITVNRNAVPFDPRPPMITSGLRIGTPALAARGFSHNDFRAVADLIAA
                     ALTATNDDQLGPLRAQVQRLAARYPLYPELHRT".

By fetching this file in perl i am able to read all the content of file and can print product name and locus tag but after that i can't print locus tag of a particular product which contains methyltransferase like this (/locus_tag="Rv0070c" - /product="serine hydroxymethyltransferase"). Can anyone help me to solve this problem.???

genome • 3.4k views
ADD COMMENT
0
Entering edit mode

I tidied your post. Please make use of the code button (the one with 101 010) when you paste some output. Highlight the text and then click the button.

Thanks!

ADD REPLY
0
Entering edit mode

Do I understand correctly you are able to print out all "locus_tag" "product" pairs? If so then why not simply grep the desired lines from the total output?

ADD REPLY
2
Entering edit mode
6.8 years ago
Sej Modha 5.3k

BioPython solution to print locus and product tags for all products that contain 'methyltransferase':

from Bio import SeqIO
records = SeqIO.parse("sequence.gb", "genbank")
for seq_record in records:
    for feature in seq_record.features:
        if 'CDS' in (feature.type):
            if 'methyltransferase' in str(feature.qualifiers["product"]):
                print(feature.qualifiers["locus_tag"],feature.qualifiers['product'])

BioPython solution to print locus and product tags for CDS that contain Rv0070c as locus_tag:

from Bio import SeqIO
records = SeqIO.parse("sequence.gb", "genbank")
for seq_record in records:
    for feature in seq_record.features:
        if 'CDS' in (feature.type):
            if 'Rv0070c' in str(feature.qualifiers["locus_tag"]):
                print(feature.qualifiers["locus_tag"],feature.qualifiers['product'])
ADD COMMENT
0
Entering edit mode

Thank you Sej Modha Sir for your help but i have to write a perl script for the same. If you know about its perl script please help me.

ADD REPLY
0
Entering edit mode

Sorry, but I can't help with that. PS: I am not sir!

ADD REPLY
0
Entering edit mode
6.8 years ago

very dirty hack but given your file is well formatted and you are really stuck it might do the trick:

grep -P "/locus_tag|/product" <your file> |  paste - - | grep 'serine hydroxymethyltransferase'
ADD COMMENT

Login before adding your answer.

Traffic: 1309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6