Substractive Genomics Analysis
1
0
Entering edit mode
6.2 years ago

EDITED

This is a my dataset look like:

  Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
    (strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

    Length=788
                                                                          Score     E
    Sequences producing significant alignments:                          (Bits)  Value

      sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=...  109     8e-24
      tr|A8K1E1|A8K1E1_HUMAN cDNA FLJ75589, highly similar to Homo sa...  107     4e-23
      sp|P20585|MSH3_HUMAN DNA mismatch repair protein Msh3 OS=Homo s...  107     4e-23
      tr|B4DSB9|B4DSB9_HUMAN cDNA FLJ51069, highly similar to DNA mis...  102     1e-21
      tr|B4DL39|B4DL39_HUMAN cDNA FLJ57316, highly similar to DNA mis...  102     1e-21
      tr|A0A2R8YFH0|A0A2R8YFH0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
      tr|A0A2R8Y6P0|A0A2R8Y6P0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
      tr|B4DN49|B4DN49_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
      tr|E9PHA6|E9PHA6_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
      sp|P43246|MSH2_HUMAN DNA mismatch repair protein Msh2 OS=Homo s...  101     3e-21
      tr|Q53GS1|Q53GS1_HUMAN DNA mismatch repair protein (Fragment) O...  101     3e-21
      tr|A0A2R8YG02|A0A2R8YG02_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
      tr|Q53FK0|Q53FK0_HUMAN DNA mismatch repair protein (Fragment) O...  100     6e-21
      tr|B4DZX3|B4DZX3_HUMAN cDNA FLJ54211, highly similar to MutS pr...  90.1    5e-18
      tr|A0A0G2JJ70|A0A0G2JJ70_HUMAN MSH5-SAPCD1 readthrough (NMD can...  89.7    5e-18
      tr|A2ABF0|A2ABF0_HUMAN cDNA FLJ39914 fis, clone SPLEN2018732, h...  89.7    5e-18
      tr|Q9UFG2|Q9UFG2_HUMAN Uncharacterized protein DKFZp434C1615 (F...  87.0    6e-18
      tr|H0YF11|H0YF11_HUMAN MSH5-SAPCD1 readthrough (NMD candidate) ...  87.0    6e-18

    > sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=9606 
    GN=MSH4 PE=1 SV=2
    Length=936

     Score = 109 bits (273),  Expect = 8e-24, Method: Compositional matrix adjust.
     Identities = 71/228 (31%), Positives = 118/228 (52%), Gaps = 8/228 (4%)



    > tr|Q0QEN7|Q0QEN7_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
    sapiens OX=9606 GN=ATP5B PE=2 SV=1
    Length=445

     Score = 590 bits (1522),  Expect = 0.0, Method: Compositional matrix adjust.
     Identities = 300/448 (67%), Positives = 357/448 (80%), Gaps = 12/448 (3%)
    --
    Query  423  SYVPVAETVRGFKEILEGKHDNLPEEAF  450
                  VP+ ET++GF++IL G++D+LPE+AF
    Sbjct  416  KLVPLKETIKGFQQILAGEYDHLPEQAF  443


    > tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
    sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
    Length=362

     Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
     Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
    --
    Query  342  DPLASSSSALAPEIVGEEHYEVATEVQ  368
                DPL S+S  + P IVG EHY+VA  VQ
    Sbjct  336  DPLDSTSRIMDPNIVGSEHYDVARGVQ  362


    > tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
    (Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
    Length=270

     Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
     Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
    --
    Query  265  LGRMPSAVGYQPTLATEMGQLQERITSTKKGSITSIQAIYVPADDYTD  312
                LGR+PSAVGYQPTLAT+MG +QERIT+TKKGSITS+QAIYVPADD TD
    Sbjct  223  LGRIPSAVGYQPTLATDMGTMQERITTTKKGSITSVQAIYVPADDLTD  270





    Output i want is: 

    Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
    (strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

    > tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
    sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
    Length=362

     Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
     Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)

    > tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
    (Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
    Length=270

     Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
     Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)

    I want the Query of the respective strains having Identities 70% or greater.
alignment shell scripting • 1.9k views
ADD COMMENT
1
Entering edit mode

What have you tried so far? Please post your current code so people can provide you feedback on that and help you with this question.

ADD REPLY
1
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

Thanks.. for your help, but unfortunately its not working.. i actually i want to keep the header and further work out with your command of grep.. so if u could me out with this.

ADD REPLY
0
Entering edit mode

Hi there,

Please note that it is not recommended to post any additional comment and follow up questions as answers, please use ADD REPLY to comment on the solution posted. I have reformatted this for you at this time.

ADD REPLY
0
Entering edit mode

Hello waqarlodhi93,

See https://www.gnu.org/software/grep/manual/grep.html for more information about line control and checkout -A parameter.

2.1.5 Context Line Control
Context lines are non-matching lines that are near a matching line. They are output only if one of the following options are used. Regardless of how these options are set, grep never outputs any given line more than once. If the -o (--only-matching) option is specified, these options have no effect and a warning is given upon their use.

-A num
--after-context=num
Print num lines of trailing context after matching lines.

-B num
--before-context=num
Print num lines of leading context before matching lines
ADD REPLY
4
Entering edit mode
6.2 years ago
sacha ★ 2.4k

use regular expression with grep to select line with Identities = 228/327 (70%) and print 5 line before ( -B 5 ) More than 70% can be expressed as : (([7-9]\d|100)

cat your_file.txt |grep -P -B5 'Identities = \d+/\d+\s\(([7-9]\d|100)%'

ADD COMMENT
0
Entering edit mode

Thanks @sacha, your provided command is really helpful but i want some thing more look into the detail below. This is a my dataset look like:

Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

Length=788
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=...  109     8e-24
  tr|A8K1E1|A8K1E1_HUMAN cDNA FLJ75589, highly similar to Homo sa...  107     4e-23
  sp|P20585|MSH3_HUMAN DNA mismatch repair protein Msh3 OS=Homo s...  107     4e-23
  tr|B4DSB9|B4DSB9_HUMAN cDNA FLJ51069, highly similar to DNA mis...  102     1e-21
  tr|B4DL39|B4DL39_HUMAN cDNA FLJ57316, highly similar to DNA mis...  102     1e-21
  tr|A0A2R8YFH0|A0A2R8YFH0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
  tr|A0A2R8Y6P0|A0A2R8Y6P0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
  tr|B4DN49|B4DN49_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
  tr|E9PHA6|E9PHA6_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
  sp|P43246|MSH2_HUMAN DNA mismatch repair protein Msh2 OS=Homo s...  101     3e-21
  tr|Q53GS1|Q53GS1_HUMAN DNA mismatch repair protein (Fragment) O...  101     3e-21
  tr|A0A2R8YG02|A0A2R8YG02_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
  tr|Q53FK0|Q53FK0_HUMAN DNA mismatch repair protein (Fragment) O...  100     6e-21
  tr|B4DZX3|B4DZX3_HUMAN cDNA FLJ54211, highly similar to MutS pr...  90.1    5e-18
  tr|A0A0G2JJ70|A0A0G2JJ70_HUMAN MSH5-SAPCD1 readthrough (NMD can...  89.7    5e-18
  tr|A2ABF0|A2ABF0_HUMAN cDNA FLJ39914 fis, clone SPLEN2018732, h...  89.7    5e-18
  tr|Q9UFG2|Q9UFG2_HUMAN Uncharacterized protein DKFZp434C1615 (F...  87.0    6e-18
  tr|H0YF11|H0YF11_HUMAN MSH5-SAPCD1 readthrough (NMD candidate) ...  87.0    6e-18

> sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=9606 
GN=MSH4 PE=1 SV=2
Length=936

 Score = 109 bits (273),  Expect = 8e-24, Method: Compositional matrix adjust.
 Identities = 71/228 (31%), Positives = 118/228 (52%), Gaps = 8/228 (4%)



> tr|Q0QEN7|Q0QEN7_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
sapiens OX=9606 GN=ATP5B PE=2 SV=1
Length=445

 Score = 590 bits (1522),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 300/448 (67%), Positives = 357/448 (80%), Gaps = 12/448 (3%)
--
Query  423  SYVPVAETVRGFKEILEGKHDNLPEEAF  450
              VP+ ET++GF++IL G++D+LPE+AF
Sbjct  416  KLVPLKETIKGFQQILAGEYDHLPEQAF  443


> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362

 Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
 Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
--
Query  342  DPLASSSSALAPEIVGEEHYEVATEVQ  368
            DPL S+S  + P IVG EHY+VA  VQ
Sbjct  336  DPLDSTSRIMDPNIVGSEHYDVARGVQ  362


> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270

 Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
 Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
--
Query  265  LGRMPSAVGYQPTLATEMGQLQERITSTKKGSITSIQAIYVPADDYTD  312
            LGR+PSAVGYQPTLAT+MG +QERIT+TKKGSITS+QAIYVPADD TD
Sbjct  223  LGRIPSAVGYQPTLATDMGTMQERITTTKKGSITSVQAIYVPADDLTD  270





Output i want is: 

Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362

 Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
 Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)

> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270

 Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
 Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)

I want the Query of the respective strains having Identities 70% or greater.
ADD REPLY
0
Entering edit mode

So, just remove the header ( with awk for instance) and apply my previous command line.

cat test.txt |awk 'BEGIN{keep=0}{if ($0 ~ "^>"){keep=1} if (keep == 1) print($0)}'|grep -P -B5 'Identities = \d+/\d+\s\(([7-9]\d|100)%'
ADD REPLY

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6