Question

How to filter evidence modeler gene prediction output

0

Entering edit mode

3.9 years ago

santhoshhegde278 • 0

Hello, I have annotated one of the plant genome using Evidence modeller pipeline. Now I have raw output from evidence modeller. Can anybody kindly suggest how can I filter the genes which have score less than 1000.

I have EVM output file as below

!! Predictions spanning range 3415 - 137363 [R1]
# EVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00) noncoding_equivalent(495.34) raw_noncoding(495.34) offset(0.00) 
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP}

# EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00) noncoding_equivalent(432.00) raw_noncoding(432.00) offset(0.00) 
20968   21183   single+ 1   3   {GeneID_mRNA_scaffold6_size143996_6;GeneID}

# EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00) noncoding_equivalent(846.00) raw_noncoding(846.00) offset(0.00) 
22362   21940   single- 4   6   {GeneID_mRNA_scaffold6_size143996_7;GeneID}

# EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00) noncoding_equivalent(1745.00) raw_noncoding(2183.00) offset(438.00) 
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

# EVM prediction: Mode:STANDARD S-ratio: 14.24 40247-42061 orient(-) score(32439.00) noncoding_equivalent(2277.99) raw_noncoding(2277.99) offset(0.00) 
42061   40247   single- 4   6   {Augustus_model.g40.t1;Augustus},{SNAP_model.scaffold6_size143996-snap.4;SNAP}

# EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36) 
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

# EVM prediction: Mode:STANDARD S-ratio: 1.33 59853-60205 orient(+) score(730.00) noncoding_equivalent(549.66) raw_noncoding(865.75) offset(316.09) 
59853   59913   initial+    1   1   {Augustus_model.g43.t1;Augustus}
59914   60011   INTRON          {Augustus_model.g43.t1;Augustus}
60012   60205   terminal+   2   3   {GeneID_mRNA_scaffold6_size143996_14;GeneID}

I want to filter the records with score less than 1000.

Kindly help me to filter the records

Thanks in advance

gene annotation awk python shell • 1.6k views

ADD COMMENT • link updated 3.9 years ago by Joe 21k • written 3.9 years ago by santhoshhegde278 • 0

0

Entering edit mode

out of curiosity, how did you get to the "1000" threshold?

ADD REPLY • link 3.9 years ago by lieven.sterck 15k

0

Entering edit mode

From one of the research paper

ADD REPLY • link 3.9 years ago by santhoshhegde278 • 0

score 0 · Answer 1 · 2021-01-11

Here's an approach you can take in python:

import sys, re
from itertools import groupby


regex = re.compile(r"score\(\d+\.\d+\)")

with open(sys.argv[1], "r") as evm:
    groups = [list(group) for key, group in groupby(evm, lambda line: line.startswith('# EVM prediction:'))]
    for i, j in zip(groups[1::2], groups[2::2]):
        score = re.search(regex, i[0]).group(0)
        score_f = float(re.search("\d+\.\d+", score).group(0))
        if score_f < 1000:
            print(i, j)

If your file above were called test.evm, run this as: python scriptname.py test.evm.

The output I get is:

['# EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00) noncoding_equivalent(432.00) raw_noncoding(432.00) offset(0.00) \n'] ['20968   21183   single+ 1   3   {GeneID_mRNA_scaffold6_size143996_6;GeneID}\n', '\n']
['# EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00) noncoding_equivalent(846.00) raw_noncoding(846.00) offset(0.00) \n'] ['22362   21940   single- 4   6   {GeneID_mRNA_scaffold6_size143996_7;GeneID}\n', '\n']
['# EVM prediction: Mode:STANDARD S-ratio: 1.33 59853-60205 orient(+) score(730.00) noncoding_equivalent(549.66) raw_noncoding(865.75) offset(316.09) \n'] ['59853   59913   initial+    1   1   {Augustus_model.g43.t1;Augustus}\n', '59914   60011   INTRON          {Augustus_model.g43.t1;Augustus}\n', '60012   60205   terminal+   2   3   {GeneID_mRNA_scaffold6_size143996_14;GeneID}\n']

You will need to do your own subsequent formatting and tidying up as you haven't specified what output format you require.