How To Extract Part Of A Line In File ?
6
4
Entering edit mode
13.7 years ago
Divya ▴ 70

Hi everyone.!!! I want an idea how to get only molecular function and biological process in the following line I tried a lot but unable to do so. Thanks for your kind help.

INPUT

chr11    RAP3_rep    mRNA    29460681    29463782    .    + . ID=Os11t0674400-00;Name=Os11t0674400-00;GO=Molecular Function: ATP binding (GO:0005524),Biological Process: apoptosis (GO:0006915);ID_converter=Os11g0674400;InterPro=NB-ARC (IPR002182);Link_to=NB-ARC (Plant Gene Family Database);Locus_id=Os11g0674400;Note=Similar to NB-ARC domain containing protein%2C expressed.;ORF_evidence=Q2QZS0 (UniProt);Transcript_evidence=ab initio prediction;Sequence_download=Os11t0674400-00
parsing gene • 6.7k views
ADD COMMENT
2
Entering edit mode

Well, even with Perl there's probably more than one way of doing it, isn't there ;-)

ADD REPLY
2
Entering edit mode

My point is there's no reason to prefer Perl over [?]

ADD REPLY
1
Entering edit mode

Why is this tagged Perl? It's just one way of doing it.

ADD REPLY
0
Entering edit mode

My point is there's no reason to prefer Perl over [?]insert your favourite language here[?]

ADD REPLY
0
Entering edit mode

My point is there's no reason to prefer Perl over >insert your favourite language here>

ADD REPLY
9
Entering edit mode
13.7 years ago

You should learn some basic UNIX commands:

echo '(yourline)' |\
tr ";" "\n" |\
egrep "^GO=" |\
cut -c 4- |\
tr "," "\n"

Molecular Function: ATP binding (GO:0005524)
Biological Process: apoptosis (GO:0006915)
ADD COMMENT
5
Entering edit mode

Yep - a glance at "Ad Hoc Data Analysis From The Unix Command Line" should be required reading for everyone:

http://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_Unix_Command_Line

ADD REPLY
2
Entering edit mode

"There's more than one way to do it"

ADD REPLY
2
Entering edit mode

lh3: It depends on how you use it. If in combination with another command-line tool, Bash seems to be the more natural way to do it. Perl is more flexible, yes, but in that case also more complicated and an additional dependency (although it should be available everywhere).

ADD REPLY
2
Entering edit mode

note than grep is actually more powerful than egrep, which for having to maintain retro-compatibility with older versions of grep, has less options.

ADD REPLY
1
Entering edit mode

Wouldn't just replacing echo '(yourline)' by cat 'yourfile' solve the problem lh3?

ADD REPLY
0
Entering edit mode

If divya wants to process this line, (s)he must have many more lines like this. The perl solution provided by Chris is definitely much more appropriate.

ADD REPLY
0
Entering edit mode

But there are good ways and bad ways.

ADD REPLY
0
Entering edit mode

Downvotes?!? (?) :|

ADD REPLY
0
Entering edit mode

But there are good and bad ways. Perl is more appropriate for the task here.

ADD REPLY
0
Entering edit mode

Of course I know such simple tricks. But what if I also want to print the gene name or write the output in one line? What if somewhere down the file the format is a little different? Yes, I can still use unix commands to do that, but perl is much more flexible.

ADD REPLY
0
Entering edit mode

Wow, I never knew about the "join" command in linux. Enlightenment +1. Thanks Mr. Swan.

ADD REPLY
7
Entering edit mode
13.7 years ago

Or in Perl:

perl -ne 'if ($_ =~/Molecular Function: ([^,]+)/){print $1 . "\n"};if ($_ =~/Biological Process: ([^;]+);/){print $1 . "\n"}' <infile
ADD COMMENT
0
Entering edit mode

I would write 'print "$1n" if /Molecular Function: ([^,]+)/' instead, which is simpler.

ADD REPLY
5
Entering edit mode
13.7 years ago

Another BASH solution (IMHO there's no need a scripting language here), as Molecular Function and Biological Process seem to be grouped (i.e., separated by , where the group is separated by ; from others):

echo $file | egrep -o "Molecular[^;]+"

Which would be csv format already, or, if you want it tab-delimited

echo $file | egrep -o "Molecular[^;]+" | tr "," "\t"

But it would be so much nicer if grep supported lookaheads ;-)

ADD COMMENT
1
Entering edit mode

While good regex on a command line saves a lot time, it is perfectly incomprehensible for beginners and often hard to modify by cut and paste method. So what you use depends not only on what is faster to run/write but also on the target audience. Verbose and maybe even slower solution in some easy & popular scripting language are a way to go if you got not a programer IMHO.

ADD REPLY
0
Entering edit mode

I agree with you if the scripting solution can indeed be easier to understand than the regex alone, as each solution has its advantages and drawbacks.

ADD REPLY
2
Entering edit mode
13.7 years ago
Darked89 4.7k

You may try this one:

#!/usr/bin/env python

"""
./this_script.py your_gff_file
output now: RAP3_rep Name=Os11t0674400-00

"""

import sys

input_fn = sys.argv[1]

for line in open(input_fn).readlines():
  splited_line       = line.split()
  long_record_split  = splited_line[8].split(";")
  print splited_line[1], long_record_split[1]

just replace and put whichever records you wish to have. These will be separated by tabs

ADD COMMENT
2
Entering edit mode
13.7 years ago

Looks like a job for awk.

$awk -F'\t' '{print $9}'  awkexample.txt | awk -F';' '{print $3,$4}'

That would print the ninth column of the line - assuming that the spaces in the original file are tabs. That could then be passed to awk again to separate out the Molecular Function and other terms.

I return this after running the command on the above example (random whitespace in above example converted to tabs and named awkexample.txt).

GO=Molecular Function: ATP binding (GO:0005524),Biological Process: apoptosis (GO:0006915) ID_converter=Os11g0674400

You can call awk from perl, python etc.. and therefore work this into a script if you need to.

ADD COMMENT
0
Entering edit mode

This assumes that the function and process appear always at the same position. It also does not separate the data from its meta-data annotation "Molecular Function" and "Biological Process".

ADD REPLY
0
Entering edit mode

Please explain how you would ever want to call awk from Perl or Python, that makes no sense to me ;-)

ADD REPLY
0
Entering edit mode

Well, you could call perl from python, and then, since perl is more integrated with the OS, call awk from there. See? Makes perfect sense! :P The times where I end up doing something similar from python is mostly for creating and quick editing files. Probably through laziness of coding a few more lines. Of course, if awk can do it, so can python and perl...

ADD REPLY
0
Entering edit mode

but function and process is not in the same position always

ADD REPLY
0
Entering edit mode

The python library for awk is called gawk - I think. Alternatively you can make a sys call from either python or perl. If you need to strip the meta-data annotation then pipe the last part into sed and remove 'GO=Molecular Function:' and 'Biological Process:' I'll get back to you on the how to cope with the function and process being in different positions - but a larger example might help.

ADD REPLY
0
Entering edit mode
13.7 years ago
Malcolm.Cook ★ 1.5k

Module Bio::DB::SeqFeature from BioPerl can extract the GO attributes and other GFF cruft as you desire, like this:

[?]

Which generates: [?] Os11t0674400-00 Molecular Function: ATP binding (GO:0005524) Biological Process: apoptosis (GO:0006915) [?]

Considerations:

  • This approach requires input to be well-formed GFF. The example input line was not since it had no tabs. But that was probably an artifact of cut&paste into a web form...

  • It may not be as performant as sed, etc, but may make fewer assumptions about the names and or orders in which the GO terms appear within the GFF

  • If you don't have BioPerl installed already, it takes more talent to install than learning to write the sed to hack the line in the first place.

  • BioPerl has lots of other goodies if you're in the goody business

ADD COMMENT

Login before adding your answer.

Traffic: 1226 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6