Question

How To Define And Calculate Cip ( A New Parameters For Blast Analysis Of Cip And Calp Method)?

2

Entering edit mode

14.9 years ago

Liuyunlong ▴ 130

IN the paper “Improved criteria and comparative genomics tool provide new insights into grass paleogenomics " they said:

To increase the significance of inter-specific sequence alignments for inferring evolutionary relationships between genomes, we defined two new parameters for BLAST analysis: CIP for Cumulative Identity Percentage and CALP for Cumulative Alignment Length Percentage. CIP = ∑nb ID by (HSP/AL) x 100 corresponds to the cumulative percent of sequence identity observed for all the HSPs divided by the cumulative aligned length (AL) which corresponds to the sum of all HSP lengths. CALP = AL/query length is the sum of the HSP lengths (AL) for all HSPs divided by the length of the query sequence. With these parameters, BLAST produces the highest cumulative percentage identity over the longest cumulative length thereby increasing stringency in defining conservation between two genome sequences.

In my opinion, CIP is simply the sum of num_identical (number of identical residues, 'Bioperl') for all HSPs divided by the AL (sum of all HSP lengths). Am I right? But the formula CIP = ∑nb ID by (HSP/AL) x 100 puzzle me. Can I interprete it as

CIP=∑nb ID and ID=(HSP/AL) x 100,

but what's the meaning of the 'HSP' here ? and 'nb'?

Can you help me?

blast • 9.0k views

ADD COMMENT • link 14.9 years ago by Liuyunlong ▴ 130

0

Entering edit mode

Can you link to the paper so we can look at it?

ADD REPLY • link 14.9 years ago by Neilfws 49k

0

Entering edit mode

Don't worry - full text link is http://bib.oxfordjournals.org/cgi/content/full/10/6/619.

ADD REPLY • link 14.9 years ago by Neilfws 49k

0

Entering edit mode

AFAIK, HSP or hsp = high-scoring segment pairs (HSPs). You can obtain this from a BLAST output. Ref. http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml. Not sure what's nb is, don't have access to paper. You may also try to get exact details by enquiring with authors as well.

ADD REPLY • link 14.9 years ago by Khader Shameer 18k

Ram · Answer 1 · 2010-08-23

They seem quite fond of these statistics. They are also defined in: Structure and expression analysis of rice paleo duplications, where the equation for CIP is described in a slightly different (and clearer) way - "by" is the same as "/":

∑nb ID/HSP/AL x 100

In "Bioperl terms", I would say that:

ID  = $hsp->num_identical
HSP = $hsp->length('total')
AL  = sum of all $hsp->length('total')

So you would first calculate AL from all HSPs. Then you would calculate ID and HSP for each $hsp, divide by AL, multiply by 100 and sum them all together. Basically, it just gives you a percentage identity across the whole sequence by adding together the identities for the HSPs.

Ram · Answer 2 · 2010-08-23

Thanks first,"add comment" have a word number limitation，so I write my try by this way.

Here is my perl script:

use warnings;
use strict;
use Bio::SearchIO;

my $infile  = shift;

my $parser = Bio::SearchIO->new(-file   => $infile,
                                -format => 'blast');
while( my $result = $parser->next_result ) {
    print "Query: ",$result->query_name,"\tlength: ",$result->query_length,
          "\tnumber of hits: ",$result->num_hits,"\n";
    while(my $hit=$result->next_hit){
        print "\tHit: ",$hit->name,"\tlength: ",$hit->hit_length,
                  "\te-value: ",$hit->significance,"\n";
        my $al=0;
        my @id=();
        while(my $hsp=$hit->next_hsp){
            print "\t\tHsp length: ",$hsp->hsp_length,"\tID length: ",
                          $hsp->num_identical,"\t idenity: ",
                          $hsp->percent_identity,"\n";
            $al=$al+($hsp->hsp_length);
            my $ai=($hsp->num_identical)/($hsp->hsp_length);
            push @id,$ai;
        }
        my $cip=0;
        foreach (@id){
            $cip=$cip+($_/$al*100);
            }
        my $calp=$al/($result->query_length);
        print "\tcip: ",$cip,"\tcalp: ",$calp,"\n";
    }
}

I parsed the blastn result (cutoff: 1e-40,all genes from rice chromosome 11 vs all genes from Brachypodium distachyon chromosome 4 )

single hsp sample output:

Query: LOC_Os11g01010.1 length: 357 number of hits: 1
    Hit: Bradi4g45390.1 length: 366 e-value: 9e-76
        Hsp length: 337 ID length: 288   idenity: 85.459940652819
    cip: 0.253590328346644  calp: 0.943977591036415

and multiple hsps sample output:

Query: LOC_Os11g01380.1 length: 5127    number of hits: 4
    Hit: Bradi4g26880.1 length: 4821    e-value: 0.0
        Hsp length: 2829    ID length: 2576  idenity: 91.0569105691057
        Hsp length: 1437    ID length: 1298  idenity: 90.3270702853166
        Hsp length: 432 ID length: 387   idenity: 89.5833333333333
        cip: 0.0576771635137837         calp: 0.916325336454067

All the CIP value are too low, and nearly no hit is greater than 60% (cutoff value as the paper said). But truth is the two chromosome were orthologous chromosomes and have a large high conserved segment. If my script is right , the result infers that this explanation of the formula may be unreasonable.