How to filter evidence modeler gene prediction output
1
0
Entering edit mode
3.9 years ago

Hello, I have annotated one of the plant genome using Evidence modeller pipeline. Now I have raw output from evidence modeller. Can anybody kindly suggest how can I filter the genes which have score less than 1000.

I have EVM output file as below

!! Predictions spanning range 3415 - 137363 [R1]
# EVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00) noncoding_equivalent(495.34) raw_noncoding(495.34) offset(0.00) 
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP}

# EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00) noncoding_equivalent(432.00) raw_noncoding(432.00) offset(0.00) 
20968   21183   single+ 1   3   {GeneID_mRNA_scaffold6_size143996_6;GeneID}

# EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00) noncoding_equivalent(846.00) raw_noncoding(846.00) offset(0.00) 
22362   21940   single- 4   6   {GeneID_mRNA_scaffold6_size143996_7;GeneID}

# EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00) noncoding_equivalent(1745.00) raw_noncoding(2183.00) offset(438.00) 
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

# EVM prediction: Mode:STANDARD S-ratio: 14.24 40247-42061 orient(-) score(32439.00) noncoding_equivalent(2277.99) raw_noncoding(2277.99) offset(0.00) 
42061   40247   single- 4   6   {Augustus_model.g40.t1;Augustus},{SNAP_model.scaffold6_size143996-snap.4;SNAP}

# EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36) 
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

# EVM prediction: Mode:STANDARD S-ratio: 1.33 59853-60205 orient(+) score(730.00) noncoding_equivalent(549.66) raw_noncoding(865.75) offset(316.09) 
59853   59913   initial+    1   1   {Augustus_model.g43.t1;Augustus}
59914   60011   INTRON          {Augustus_model.g43.t1;Augustus}
60012   60205   terminal+   2   3   {GeneID_mRNA_scaffold6_size143996_14;GeneID}

I want to filter the records with score less than 1000.

Kindly help me to filter the records

Thanks in advance

gene annotation awk python shell • 1.6k views
ADD COMMENT
0
Entering edit mode

out of curiosity, how did you get to the "1000" threshold?

ADD REPLY
0
Entering edit mode

From one of the research paper

ADD REPLY
0
Entering edit mode
3.9 years ago
Joe 21k

Here's an approach you can take in python:

import sys, re
from itertools import groupby


regex = re.compile(r"score\(\d+\.\d+\)")

with open(sys.argv[1], "r") as evm:
    groups = [list(group) for key, group in groupby(evm, lambda line: line.startswith('# EVM prediction:'))]
    for i, j in zip(groups[1::2], groups[2::2]):
        score = re.search(regex, i[0]).group(0)
        score_f = float(re.search("\d+\.\d+", score).group(0))
        if score_f < 1000:
            print(i, j)

If your file above were called test.evm, run this as: python scriptname.py test.evm.

The output I get is:

['# EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00) noncoding_equivalent(432.00) raw_noncoding(432.00) offset(0.00) \n'] ['20968   21183   single+ 1   3   {GeneID_mRNA_scaffold6_size143996_6;GeneID}\n', '\n']
['# EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00) noncoding_equivalent(846.00) raw_noncoding(846.00) offset(0.00) \n'] ['22362   21940   single- 4   6   {GeneID_mRNA_scaffold6_size143996_7;GeneID}\n', '\n']
['# EVM prediction: Mode:STANDARD S-ratio: 1.33 59853-60205 orient(+) score(730.00) noncoding_equivalent(549.66) raw_noncoding(865.75) offset(316.09) \n'] ['59853   59913   initial+    1   1   {Augustus_model.g43.t1;Augustus}\n', '59914   60011   INTRON          {Augustus_model.g43.t1;Augustus}\n', '60012   60205   terminal+   2   3   {GeneID_mRNA_scaffold6_size143996_14;GeneID}\n']

You will need to do your own subsequent formatting and tidying up as you haven't specified what output format you require.

ADD COMMENT
0
Entering edit mode

Hii, Thanks. The code is working. I wanted the output format as same as input format

ADD REPLY
0
Entering edit mode

The lists that are output are of exactly the same format as the input lines (including newlines etc) so it should be a very easy task to remake the input format - that will be a good coding exercise for you.

As a hint, the groups[0] object will contain the first line of the file too so you can put that back in as necessary.

ADD REPLY

Login before adding your answer.

Traffic: 2493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6