How to grep ^I0.0^I in blast outfmt 6
3
0
Entering edit mode
8.4 years ago
xi100f • 0

Hi,

I am having hard time with greping e-value 0.0 in blast outfmt 6 I cant find right combination of non printing characters as tabs are coded as '^I' .

How do you do it?

Example: 
query1    Contig_0004161        98.69   381     5       0       3       383     233134  233514  0.0      676

cat  -A outputs 
query1^IContig_0004161^I98.69^I381^I5^I0^I3^I383^I233134^I233514^I0.0^I 676

Thanks in advance!

blast • 2.7k views
ADD COMMENT
2
Entering edit mode

You can do that using simple awk

awk -F "\t" '$(NF-1) == 0 {print }' input_file >outfile

If you want to specific column, u could output that using the column number

ADD REPLY
2
Entering edit mode
8.4 years ago

Don't use grep, use awk instead like awk -v FS='\t' '$11 < 1e-6', replace 1e-6 with a number small enough for your needs. I wouldn't use 0.0 though as in theory it is not meaningful.

By the way, if you really want to use grep maybe grep -P '\t0\.0\t\d+$'?

ADD COMMENT
1
Entering edit mode

0\.0 is better than 0.0 in your grep instruction or it will catch things like 010, etc.

ADD REPLY
1
Entering edit mode

Thanks, good point, I edited my answer. This reminds me: A programmer has a problem and thinks "I know, I'll use regular expressions". Now he has two problems.

ADD REPLY
0
Entering edit mode
Query1^IContig1^I100.00^I582^I0^I0^I62^I643^I39330^I39911^I0.0^I1075$
Query2^IContig2^I96.22^I582^I22^I0^I62^I643^I67349^I66768^I0.0^I 953$

Rob's grep -P '\t0\.0\t' would catch both lines but Dariober's grep -P '\t0\.0\t\d+$' only the first one. + does not match zero occurrences like the 1st character in last column. In fact, it did not match any line in my file with e value 0.0 and Bit score lower than 1000.

ADD REPLY
1
Entering edit mode

The second line is not grep'd by '\t0\.0\t\d+$' because you have a space after the tab character. Is your input actually tab separated? Take care also that '\t0\.0\t' will match any line containing 0.0 surrounded by tab anywhere not just in the second last column (hence my suggestion of using '\t0.0\t\d+$').

In my opinion, parsing tabular data with regexes is a very bad idea. awk is better but still brittle, python is even better as you have much more control of what you are doing.

ADD REPLY
1
Entering edit mode
8.4 years ago
Rob ▴ 150

You can do:

grep $'\t0.0\t'

or:

egrep "\s0\.0\s"

or better:

grep -P '\t0\.0\t'

but the solution of dariober with awk is really prettier.

ADD COMMENT
0
Entering edit mode
egrep "\s0\.0\s"

or

grep -P '\t0\.0\t'

both work as expected, thanks! I realise that it may catch all instances of '0.0', not only in e-value column.

Indeed awk seems better fitted for searching 0.0 in 11th, e-value column.

ADD REPLY
1
Entering edit mode
7.9 years ago
Vitis ★ 2.6k

I'm a big fan of APIs so I usually use BioPerl or BioPython to parse BLAST results. In BioPerl and BioPython, there are modules handling different BLAST results without hick-ups like this one.

http://bioperl.org/howtos/SearchIO_HOWTO.html http://biopython.org/DIST/docs/tutorial/Tutorial.html

ADD COMMENT

Login before adding your answer.

Traffic: 1761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6