Extract a particular data from text file using perl
2
0
Entering edit mode
9.5 years ago

Hello, I am new with Perl language. I want to extract particular data from the file. I have big text file, it consists of all the information about protein, from that I want gene name and copy numbers only. File:

#BEGIN_ECARDFILE
# Entry_ID:
CC1227A.1

# Accession_No.:
UA0001227

# Gene_Ontology:
>>> Function: Not Available
||
>>> Process: Not Available

# Location:
Cytoplasm

# Blattner_Number:
b3038

# Gene_Sequence:
ATGGA

# Gene_Name:
ygiC

​# Preceding_Gene:
ygiB

# Copy Number:
Unknown

# RNA_Copy_No.: Log phase (2max): 0.31 Stationary phase (2max): 0.17 Ref: Nature Biotech., 18, 1262-68, Dec. 2000 (PMID=11101804)
# Genbank_ID_(DNA):
G1789416

Output:
# Gene_Name:
ygiC
# Copy Number:
Unknown

This is repeating for every proteins means around 100 times all information and I want to store dis all in new file. What should I do? Should I use pattern matching? Thank you.

sequence • 18k views
ADD COMMENT
1
Entering edit mode

A possible Unix solution

grep -A1 '^# Gene_Name\|^# Copy Number' input.file > results.file
ADD REPLY
0
Entering edit mode

Please edit + format your question.

ADD REPLY
1
Entering edit mode
9.5 years ago
nterhoeven ▴ 120

Since you asked for a solution in perl:

#!/usr/bin/perl
use strict;
use warnings;

my$file=shift(@ARGV);

open(FILE,'<',$file) or die $!;
while(<FILE>)
{
    if($_=~/^# Gene_Name:/ || $_=~/^# Copy Number:/)
    {
        print $_;
        my$entry=<FILE>;
        print $entry;
    }
    else
    {
        next;
    }

}
close FILE or die $!;
ADD COMMENT
0
Entering edit mode

Thank you Nterhoeven

ADD REPLY
0
Entering edit mode

How do we push out the values we get to an informix database???

I'm new to perl. Any help is appreciated.

ADD REPLY
1
Entering edit mode
9.5 years ago

It seems that your data come from http://ccdb.wishartlab.com/CCDB/cgi-bin/getecards_all.cgi

your snippet doesn't show that some records can contain more than one line: Here is a awk solution

curl -s "http://ccdb.wishartlab.com/CCDB/cgi-bin/getecards_all.cgi" |\
awk '/^#/ {IN=0;} /^# Gene_Name:$/ {IN=1; printf("\n");next;} /^# Copy Number:$/ {IN=1;printf("\t");next;} {if(IN==1) printf("%s",$0);}'

beware : some records contain more than one gene-name

ygiC    Unknown
yraM    Unknown
napH    Unknown
yhfX    Unknown
cstA    Unknown
infB or ssyG    1150 Molecules/Cell In: Glucose minimal mediaRef: Neidhardt et al., Encyclopedia of E. coli and Salmonella, 1997
ugd or pmrE or udg    Unknown
nohA    Unknown
yegJ    Not Available
rapA or hepA    Unknown
yiaW    Unknown
yedR    Not Available
maeB    Unknown
narU    Unknown
rpsM    18,700 (rich media)Ref: Goodsell, D.S., Trends Biochem. Sci., 16, 203-206, Jun. 1991 (PMID=1891800)
atpF or papF or uncF    Unknown
arcC    Not Available
basR or pmrA    Unknown
yahM    Unknown
ADD COMMENT
0
Entering edit mode

Right Pierre, I am using ccdb database. Thank you for help.

ADD REPLY

Login before adding your answer.

Traffic: 1731 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6