Calculate percentage of local blast hit
1
0
Entering edit mode
5.5 years ago
fec2 ▴ 50

Hi all,

I need to calculate percentage of blastp hits from a few genome as formula below:

[(C1 + C2)/(T1 + T2)] * 100

C1 and C2 is the number of blastp hits of two genomes against each other, for example genome A against genome B represent C1 and genome B against genome A represent C2. T1 and T2 is the total number of proteins in the two genomes being compared.

Now, I have a .txt file consists total number of protein for all my genome of interest:

A:1234 B:1234 C:1234...

I also have another .txt file consists of blastp result for each other genome:

A_B 123, B_A 123, A_C 123, B_C 123..

Is it possible to have a command that is able to calculate the percentage of A_B, A_C and B_C based on formula above?

I apologise if my question is confusing and please tell me if any part of my question is unclear.

gene genome • 2.7k views
ADD COMMENT
0
Entering edit mode

a single (linux) cmdline will be difficult I guess. Are you familiar with any programming/scripting language? Writing a small script will have that processed quickly.

ADD REPLY
0
Entering edit mode

Hi, I am not familiar with any programming/scripting language. I am actually kind of new in this field.

ADD REPLY
3
Entering edit mode
5.5 years ago
AK ★ 2.2k

Hi fec2,

This should work:

$ cat prot_numbers.txt
A:1234
B:1234
C:1234

$ cat blastp.txt
A_B 123
B_A 124
A_C 123
C_A 125
B_C 123
C_B 126

$ awk -F'[:_ ]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers.txt blastp.txt
B-C 10.0891
A-B 10.0081
A-C 10.0486

-F'[:_ ]' uses :, _, and (space) as delimiters. It first creates an array named prot to save the number of protein in each genome then adds the hits from A_B and B_A to an array named num with A-B as the key (after sorting the order of each pair: A_B -> A_B, B_A -> A_B). Finally, for each pair, compute (C1 + C2) num[pair] *100/(T1 + T2) deno[pair]. Note that in the results the pairs are unique already.

ADD COMMENT
0
Entering edit mode

Hi SMK,

Thanks for your script. May I know what is the command "cat prot_numbers.txt" and "cat blastp.txt" for?

I got error as below: awk: calling undefined function asort input record number 1, file blastp.txt source line number 1

My actual data are like below:

prot_numbers.txt:

 Chryseomicrobium_excrementi_ET03.fasta:2640
 Filibacter_tadaridae_TB_66.fasta:2970
 Jeotgalibacillus_alimentarius_YKJ_13.fasta:3372

blastp.txt:

 Chryseomicrobium_excrementi_ET03.fasta_Filibacter_tadaridae_TB_66.fasta.tsv        1526
 Chryseomicrobium_excrementi_ET03.fasta_Jeotgalibacillus_alimentarius_YKJ_13.fasta.tsv      1556
 Filibacter_tadaridae_TB_66.fasta_Chryseomicrobium_excrementi_ET03.fasta.tsv        1514
 Filibacter_tadaridae_TB_66.fasta_Jeotgalibacillus_alimentarius_YKJ_13.fasta.tsv        1576
 Jeotgalibacillus_alimentarius_YKJ_13.fasta_Chryseomicrobium_excrementi_ET03.fasta.tsv      1548
 Jeotgalibacillus_alimentarius_YKJ_13.fasta_Filibacter_tadaridae_TB_66.fasta.tsv        1577

Thank you very much.

ADD REPLY
0
Entering edit mode

Hi fec2,

cat prot_numbers.txt and cat blastp.txt were just to show you how the example dataset I used looked like. Is your "actual" blastp.txt separated with a tab between the first and the second column?

ADD REPLY
0
Entering edit mode

I got the result using command from previous post (C: To get number of hits from blastp output file) as below:

(for i in *.tsv; do echo -ne "${i}\t" && awk '$3>=40{print $2}' ${i} | sort -u | wc -l; done) > blastp.txt

in which I have multiple blastp output (format 6) in a directory, and I want to calculate the number of hits with sequence identity of more than 40% for each output file.

I am not sure is it separate by tab. I am so sorry for that.

ADD REPLY
0
Entering edit mode

Hi fec2,

From echo -ne "${i}\t" we know that the results will be tab (\t) separated, and we can use man echo to check out the meaning of parameters: -ne. Or, you can use cat -T blastp.txt to check, \t will be revealed as ^I (I really suggest you make it a chance to learn and to know what each component in the commands is for, instead of just copy/paste).

In your original question, it was: A_B 123, while in your "actual" dataset it is something like A_B.tsv 123. I would suggest to first re-format your "actual" data to keep only prefix (species?) and use tab to separate them (because _ is also in your prefix which makes things complex a bit). Finally, you can change _ from the delimiters to \t for awk. This should get you there:

sed "s/.fasta_/\t/g; s/.fasta.tsv//g" blastp.txt > blastp_re.txt
sed "s/.fasta//g" prot_numbers.txt > prot_numbers_re.txt

awk -F'[:\t]' 'FILENAME=="prot_numbers_re.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers_re.txt blastp_re.txt
ADD REPLY
0
Entering edit mode

Hi SMK,

Thanks for your advice, I will learn it for the future use. However, I still got error after reformat the .txt file :

awk: calling undefined function asort
input record number 1, file blastp_re.txt
source line number 1

Any suggestion? Really thanks for your help.

ADD REPLY
1
Entering edit mode

Hi fec2,

Are you using the default awk in OSX? The commands (sed, awk) are based on the GNU Project's implementation, which should work in Linux (GNU) environment.

If you're in OSX, you can first do: brew install gnu-sed gawk (check out homebrew), and change the codes to:

gsed "s/.fasta_/\t/g; s/.fasta.tsv//g" blastp.txt > blastp_re.txt
gsed "s/.fasta//g" prot_numbers.txt > prot_numbers_re.txt

gawk -F'[:\t]' 'FILENAME=="prot_numbers_re.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers_re.txt blastp_re.txt

Check here for more information about AWK implementations.

ADD REPLY
0
Entering edit mode

Thank you very much! The command is working fine now.

ADD REPLY
0
Entering edit mode

That's great, glad it helps! :-)

ADD REPLY
0
Entering edit mode

Hi,

I wonder is it possible to have the result in matrix format? Thank you.

ADD REPLY
0
Entering edit mode

Can you provide an example?

ADD REPLY
0
Entering edit mode

For example:

              A      B       C    
       A     100  
       B     95     100     74      
       C     58     78      100
ADD REPLY
0
Entering edit mode

If still sticking to what was asked at the beginning:

Is it possible to have a command...

You can try changing the code to:

$ cat blastp.txt
A_B 123
B_A 124
A_C 123
C_A 125
B_C 123
C_B 126
$ cat prot_numbers.txt
A:1234
B:1234
C:1234
$ awk -F'[:_ ]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (i in prot) {for (j in prot) {pair=i"-"j; if (pair in num) {printf num[pair]*100/deno[pair] "\t"} else {printf ".\t"}}; printf "\n"}}' prot_numbers.txt blastp.txt
.   10.0081 10.0486
.   .   10.0891
.   .   .

Because each pair is sorted alphabetically (A-B: row-col), here it only shows A-B, A-C, and B-C.

ADD REPLY
0
Entering edit mode

Hi,

I got an error as below:

    cmd. line:1: (FILENAME=blastp_re.txt FNR=6) fatal: division by zero attempted
ADD REPLY
0
Entering edit mode

What was the input?

ADD REPLY
0
Entering edit mode

I m sorry, I missed out something important, the input blastp.txt should be:

   A          B            123
   B         C             321

And is separate by tab. I am sorry that I missed out this part

Is not A_B and B_C.

ADD REPLY
0
Entering edit mode

I see. As I explained before:

-F'[:_ ]' uses ":", "_", and " " (space) as delimiters

If the inputs are:

$ cat blastp.txt
A   B   123
B   A   124
A   C   123
C   A   125
B   C   123
C   B   126
$ cat prot_numbers.txt
A:1234
B:1234
C:1234

Changing the command to this should work:

awk -F'[:\t]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (i in prot) {for (j in prot) {pair=i"-"j; if (pair in num) {printf num[pair]*100/deno[pair] "\t"} else {printf ".\t"}}; printf "\n"}}' prot_numbers.txt blastp.txt
ADD REPLY
0
Entering edit mode

Hi,

Thank you. However, the matrix doesn't have any label. Only has number and “.”. Could it be possible to have a label? Thanks again.

ADD REPLY
1
Entering edit mode

Alright, this should give you the labels (glad that it's still "a" command):

awk -F'[:\t]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (col in prot) {printf "\t" col}; printf "\n"; for (row in prot) {printf row; for (col in prot) {pair=row"-"col; if (pair in num) {printf "\t" num[pair]*100/deno[pair]} else {printf "\t."}}; printf "\n"}}' prot_numbers.txt blastp.txt

When inputs are:

$ cat blastp.txt
A   B   123
B   A   124
A   C   123
C   A   125
B   C   123
C   B   126
$ cat prot_numbers.txt
A:1234
B:1234
C:1234

It should return...

    A   B   C
A   .   10.0081 10.0486
B   .   .   10.0891
C   .   .   .
ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode

Could it be possible that the dataset has no self comparison? For example no A-A, B-B and C-C?

ADD REPLY
0
Entering edit mode

When no pair being found it prints “.”.

ADD REPLY

Login before adding your answer.

Traffic: 2604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6