Question

Extract a particular data from text file using perl

0

Entering edit mode

10.2 years ago

amolnarwade1415 • 0

Hello, I am new with Perl language. I want to extract particular data from the file. I have big text file, it consists of all the information about protein, from that I want gene name and copy numbers only. File:

#BEGIN_ECARDFILE
# Entry_ID:
CC1227A.1

# Accession_No.:
UA0001227

# Gene_Ontology:
>>> Function: Not Available
||
>>> Process: Not Available

# Location:
Cytoplasm

# Blattner_Number:
b3038

# Gene_Sequence:
ATGGA

# Gene_Name:
ygiC

# Preceding_Gene:
ygiB

# Copy Number:
Unknown

# RNA_Copy_No.: Log phase (2max): 0.31 Stationary phase (2max): 0.17 Ref: Nature Biotech., 18, 1262-68, Dec. 2000 (PMID=11101804)
# Genbank_ID_(DNA):
G1789416

Output:
# Gene_Name:
ygiC
# Copy Number:
Unknown

This is repeating for every proteins means around 100 times all information and I want to store dis all in new file. What should I do? Should I use pattern matching? Thank you.

sequence • 19k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.2 years ago by amolnarwade1415 • 0

1

Entering edit mode

A possible Unix solution

grep -A1 '^# Gene_Name\|^# Copy Number' input.file > results.file

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.2 years ago by rbagnall ★ 1.8k

0

Entering edit mode

Please edit + format your question.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.2 years ago by Pierre Lindenbaum 166k

Ram · Answer 1 · 2015-07-13

1

Entering edit mode

10.2 years ago

nterhoeven ▴ 120

Since you asked for a solution in perl:

#!/usr/bin/perl
use strict;
use warnings;

my$file=shift(@ARGV);

open(FILE,'<',$file) or die $!;
while(<FILE>)
{
    if($_=~/^# Gene_Name:/ || $_=~/^# Copy Number:/)
    {
        print $_;
        my$entry=<FILE>;
        print $entry;
    }
    else
    {
        next;
    }

}
close FILE or die $!;

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.2 years ago by nterhoeven ▴ 120

0

Entering edit mode

Thank you Nterhoeven

ADD REPLY • link 10.2 years ago by amolnarwade1415 • 0

0

Entering edit mode

How do we push out the values we get to an informix database???

I'm new to perl. Any help is appreciated.

ADD REPLY • link 9.1 years ago by sboddeboyina • 0

Ram · Answer 2 · 2015-07-13

It seems that your data come from http://ccdb.wishartlab.com/CCDB/cgi-bin/getecards_all.cgi

your snippet doesn't show that some records can contain more than one line: Here is a awk solution

curl -s "http://ccdb.wishartlab.com/CCDB/cgi-bin/getecards_all.cgi" |\
awk '/^#/ {IN=0;} /^# Gene_Name:$/ {IN=1; printf("\n");next;} /^# Copy Number:$/ {IN=1;printf("\t");next;} {if(IN==1) printf("%s",$0);}'

beware : some records contain more than one gene-name

ygiC    Unknown
yraM    Unknown
napH    Unknown
yhfX    Unknown
cstA    Unknown
infB or ssyG    1150 Molecules/Cell In: Glucose minimal mediaRef: Neidhardt et al., Encyclopedia of E. coli and Salmonella, 1997
ugd or pmrE or udg    Unknown
nohA    Unknown
yegJ    Not Available
rapA or hepA    Unknown
yiaW    Unknown
yedR    Not Available
maeB    Unknown
narU    Unknown
rpsM    18,700 (rich media)Ref: Goodsell, D.S., Trends Biochem. Sci., 16, 203-206, Jun. 1991 (PMID=1891800)
atpF or papF or uncF    Unknown
arcC    Not Available
basR or pmrA    Unknown
yahM    Unknown