Question

Extracting pairwise alignment info

0

Entering edit mode

7.8 years ago

Kevin D ▴ 30

Hello everyone,

I am trying to write a bash script to extract information from a pairwise alignment text file output (Emboss Needle). In the txt output, there is a header as follow:

#=======================================
#
# Aligned_sequences: 2
# 1: RC0505_OB1
# 2: RC0505_OB2
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 866
# Identity:     804/866 (92.8%)
# Similarity:   804/866 (92.8%)
# Gaps:          53/866 ( 6.1%)
# Score: 3881.5
# 
#
#=======================================

I want to create a table with only one line such as i get the sequence ID, length, identity and gaps info.

So in this example, I basically want to create a table like this:

RC0505 866 804 53

I manage to extract the full line containing the Identity info for instance, using sed:

sed -n '/Identity/p' aln.txt

But then I can't figure out how to extract just the "804" part and put it in a table. And do this for the other parameters. I think that grep sed and awk functions could help but I don't know how to use them efficiently.

Any idea ?

Many thanks

pairwise alignement grep awk sed unix • 2.8k views

ADD COMMENT • link updated 7.8 years ago by 5heikki 11k • written 7.8 years ago by Kevin D ▴ 30

score 2 · Accepted Answer · 2017-02-20

EDIT After more info from the OP, this solution doesn't work in all cases where the fractions digit lengths vary.

This is neither pretty nor elegant, but it works. A more talented/persistent programmer will almost certainly be able to do it in one line without needing to recall cat each time.

#!/bin/bash

# Usage: $ bash parseemboss.sh file

percID=$(cat $1 | grep "Identity" | cut -d ' ' -f 7)
name=$(cat $1 | grep "1:" | cut -d ' ' -f 3)
length=$(cat $1 | grep "Length" | cut -d ' ' -f 3)
gaps=$(cat $1 | grep "Gaps" | cut -d ' ' -f 12)


echo -e "$name\t$length\t${percID%/*}\t${gaps%/*}"

The script prints the content of the file with cat $1 ($1 is a built-in bash variable that corresponds to the first argument on the commandline - $0 is the script name itself). It then pipes ( | ) the file to grep to find the distinguishing feature of a line (the strings "Identity", "1:", "Length" and "Gaps".) This 'filtered' file is then passed in to cut which is told the delimiter in your file is standard whitespace -d ' ' and which field you want -f.

The exception is if you were trying to cut a string that contained whitespace itself e.g. "Hello World" which cut would see as 2 different columns ["Hello", "World"].

Each of these piped commands is wrapped in variable=$(...) which tells Bash to evaluate the contents of the bracket, and save it to a variable, rather than just saving the string. This is called command substitution.

echo -e then simply prints the variables tab-separated with \t (the -e flag of echo is needed so that it interprets special characters as such.

As per your requested output format which contains no denominator in the fraction, one can use parameter expansion. A 'normal' variable recall would simply look like: $variable(usually surrounded by " ", and optionally { }).

If however, we recall the variable and use parameter expansion, we can strip parts of the variable name out e.g.:

$gaps == 53/866

but, ${gaps%/*} == 53 because we tell bash to strip after the first / we encounter (%/) and strip everything *.

score 1 · Accepted Answer · 2017-02-20

1

Entering edit mode

7.8 years ago

5heikki 11k

By no means perfect but could be of help..

awk 'BEGIN{FS=" "} NR==4 || NR==11 || NR==13 {printf "%s\t", $3}' input.txt

ADD COMMENT • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Thank you, your solution is also nice because it takes the whole fraction "804/866" or "53/866" regardless of the number of digits. I mean, my final goal is to run the script on several Emboss output files and sometimes match numbers or gaps numbers have 2 or 3 or 4 digits so I guess jrj.healey's solution will not work fine in those case since his script seems to count the position of the fraction in this particular case where the match numbers has 3 digits and the gaps number has 2 digits. Am I understanding right?

ADD REPLY • link 7.8 years ago by Kevin D ▴ 30

0

Entering edit mode

If you want the whole fraction (that was not what you asked for in your example output), you can modify my script to easily output the full fraction by removing the %/* parts in the last line. I've edited my answer to explain what it's doing as well.

i.e.:

echo -e "$name\t$length\t${percID%/*}\t${gaps%/*}"

Current output:

RC0505_OB1  866 804 53

becomes

echo -e "$name\t$length\t$percID\t$gaps"

output:

RC0505_OB1  866 804/866 53/866

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

Ok! Actually the "issue" is not of having the whole fraction. Let's say I have a second Emboss output file and the parameters are now:

# Length: 1076
# Identity:    1062/1076 (98.7%)
# Similarity:  1061/1076 (98.6%)
# Gaps:           0/1076 ( 0.0%)
# Score: 5248.0

Then the script will return:

RC0058_OB1      1076    (98.7%)

I think this is because the parseemboss.sh script assumes the Identity score always starts at the 7th position. In fact this is not true for every emboss output file but for sure you couldn't know as I didn't mention this.

ADD REPLY • link 7.8 years ago by Kevin D ▴ 30

0

Entering edit mode

Ah yeah I see the problem. My approach probably isn't going to work overall then in that case. Sorry!

ADD REPLY • link 7.8 years ago by Joe 21k