Question

How To Extract Part Of A Line In File ?

4

Entering edit mode

13.7 years ago

Divya ▴ 70

Hi everyone.!!! I want an idea how to get only molecular function and biological process in the following line I tried a lot but unable to do so. Thanks for your kind help.

INPUT

chr11    RAP3_rep    mRNA    29460681    29463782    .    + . ID=Os11t0674400-00;Name=Os11t0674400-00;GO=Molecular Function: ATP binding (GO:0005524),Biological Process: apoptosis (GO:0006915);ID_converter=Os11g0674400;InterPro=NB-ARC (IPR002182);Link_to=NB-ARC (Plant Gene Family Database);Locus_id=Os11g0674400;Note=Similar to NB-ARC domain containing protein%2C expressed.;ORF_evidence=Q2QZS0 (UniProt);Transcript_evidence=ab initio prediction;Sequence_download=Os11t0674400-00

parsing gene • 6.7k views

ADD COMMENT • link updated 13.7 years ago by Niallhaslam 2.3k • written 13.7 years ago by Divya ▴ 70

2

Entering edit mode

Well, even with Perl there's probably more than one way of doing it, isn't there ;-)

ADD REPLY • link 13.7 years ago by Rvosa ▴ 580

2

Entering edit mode

My point is there's no reason to prefer Perl over [?]

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

1

Entering edit mode

Why is this tagged Perl? It's just one way of doing it.

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

My point is there's no reason to prefer Perl over [?]insert your favourite language here[?]

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

My point is there's no reason to prefer Perl over >insert your favourite language here>

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

Ram · Answer 1 · 2011-03-23

9

Entering edit mode

13.7 years ago

Pierre Lindenbaum 164k

You should learn some basic UNIX commands:

echo '(yourline)' |\
tr ";" "\n" |\
egrep "^GO=" |\
cut -c 4- |\
tr "," "\n"

Molecular Function: ATP binding (GO:0005524)
Biological Process: apoptosis (GO:0006915)

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.7 years ago by Pierre Lindenbaum 164k

5

Entering edit mode

Yep - a glance at "Ad Hoc Data Analysis From The Unix Command Line" should be required reading for everyone:

http://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_Unix_Command_Line

ADD REPLY • link 13.7 years ago by User 59 13k

2

Entering edit mode

"There's more than one way to do it"

ADD REPLY • link 13.7 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

lh3: It depends on how you use it. If in combination with another command-line tool, Bash seems to be the more natural way to do it. Perl is more flexible, yes, but in that case also more complicated and an additional dependency (although it should be available everywhere).

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

2

Entering edit mode

note than grep is actually more powerful than egrep, which for having to maintain retro-compatibility with older versions of grep, has less options.

ADD REPLY • link 13.7 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Wouldn't just replacing echo '(yourline)' by cat 'yourfile' solve the problem lh3?

ADD REPLY • link 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

If divya wants to process this line, (s)he must have many more lines like this. The perl solution provided by Chris is definitely much more appropriate.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

But there are good ways and bad ways.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

Downvotes?!? (?) :|

ADD REPLY • link 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

But there are good and bad ways. Perl is more appropriate for the task here.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

Of course I know such simple tricks. But what if I also want to print the gene name or write the output in one line? What if somewhere down the file the format is a little different? Yes, I can still use unix commands to do that, but perl is much more flexible.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

Wow, I never knew about the "join" command in linux. Enlightenment +1. Thanks Mr. Swan.

ADD REPLY • link 13.7 years ago by Sequencegeek ▴ 740

Ram · Answer 2 · 2011-03-23

7

Entering edit mode

13.7 years ago

Chris Miller 22k

Or in Perl:

perl -ne 'if ($_ =~/Molecular Function: ([^,]+)/){print $1 . "\n"};if ($_ =~/Biological Process: ([^;]+);/){print $1 . "\n"}' <infile

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.7 years ago by Chris Miller 22k

0

Entering edit mode

I would write 'print "$1n" if /Molecular Function: ([^,]+)/' instead, which is simpler.

ADD REPLY • link 13.7 years ago by lh3 33k

Ram · Answer 3 · 2011-03-23

5

Entering edit mode

13.7 years ago

Michael Schubert ★ 7.1k

Another BASH solution (IMHO there's no need a scripting language here), as Molecular Function and Biological Process seem to be grouped (i.e., separated by , where the group is separated by ; from others):

echo $file | egrep -o "Molecular[^;]+"

Which would be csv format already, or, if you want it tab-delimited

echo $file | egrep -o "Molecular[^;]+" | tr "," "\t"

But it would be so much nicer if grep supported lookaheads ;-)

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.7 years ago by Michael Schubert ★ 7.1k

1

Entering edit mode

While good regex on a command line saves a lot time, it is perfectly incomprehensible for beginners and often hard to modify by cut and paste method. So what you use depends not only on what is faster to run/write but also on the target audience. Verbose and maybe even slower solution in some easy & popular scripting language are a way to go if you got not a programer IMHO.

ADD REPLY • link 13.7 years ago by Darked89 4.7k

0

Entering edit mode

I agree with you if the scripting solution can indeed be easier to understand than the regex alone, as each solution has its advantages and drawbacks.

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

Ram · Answer 4 · 2011-03-23

You may try this one:

#!/usr/bin/env python

"""
./this_script.py your_gff_file
output now: RAP3_rep Name=Os11t0674400-00

"""

import sys

input_fn = sys.argv[1]

for line in open(input_fn).readlines():
  splited_line       = line.split()
  long_record_split  = splited_line[8].split(";")
  print splited_line[1], long_record_split[1]

just replace and put whichever records you wish to have. These will be separated by tabs

Ram · Answer 5 · 2011-03-24

2

Entering edit mode

13.7 years ago

Niallhaslam 2.3k

Looks like a job for awk.

$awk -F'\t' '{print $9}'  awkexample.txt | awk -F';' '{print $3,$4}'

That would print the ninth column of the line - assuming that the spaces in the original file are tabs. That could then be passed to awk again to separate out the Molecular Function and other terms.

I return this after running the command on the above example (random whitespace in above example converted to tabs and named awkexample.txt).

GO=Molecular Function: ATP binding (GO:0005524),Biological Process: apoptosis (GO:0006915) ID_converter=Os11g0674400

You can call awk from perl, python etc.. and therefore work this into a script if you need to.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.7 years ago by Niallhaslam 2.3k

0

Entering edit mode

This assumes that the function and process appear always at the same position. It also does not separate the data from its meta-data annotation "Molecular Function" and "Biological Process".

ADD REPLY • link 13.7 years ago by Joachim ★ 2.9k

0

Entering edit mode

Please explain how you would ever want to call awk from Perl or Python, that makes no sense to me ;-)

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Well, you could call perl from python, and then, since perl is more integrated with the OS, call awk from there. See? Makes perfect sense! :P The times where I end up doing something similar from python is mostly for creating and quick editing files. Probably through laziness of coding a few more lines. Of course, if awk can do it, so can python and perl...

ADD REPLY • link 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

but function and process is not in the same position always

ADD REPLY • link 13.7 years ago by Divya ▴ 70

0

Entering edit mode

The python library for awk is called gawk - I think. Alternatively you can make a sys call from either python or perl. If you need to strip the meta-data annotation then pipe the last part into sed and remove 'GO=Molecular Function:' and 'Biological Process:' I'll get back to you on the how to cope with the function and process being in different positions - but a larger example might help.

ADD REPLY • link 13.7 years ago by Niallhaslam 2.3k

score 0 · Answer 6 · 2011-03-24

Module Bio::DB::SeqFeature from BioPerl can extract the GO attributes and other GFF cruft as you desire, like this:

[?]

Which generates: [?] Os11t0674400-00 Molecular Function: ATP binding (GO:0005524) Biological Process: apoptosis (GO:0006915) [?]

Considerations:

This approach requires input to be well-formed GFF. The example input line was not since it had no tabs. But that was probably an artifact of cut&paste into a web form...
It may not be as performant as sed, etc, but may make fewer assumptions about the names and or orders in which the GO terms appear within the GFF
If you don't have BioPerl installed already, it takes more talent to install than learning to write the sed to hack the line in the first place.
BioPerl has lots of other goodies if you're in the goody business