How To Define And Calculate Cip ( A New Parameters For Blast Analysis Of Cip And Calp Method)?
2
2
Entering edit mode
14.3 years ago
Liuyunlong ▴ 130

IN the paper “Improved criteria and comparative genomics tool provide new insights into grass paleogenomics " they said:

To increase the significance of inter-specific sequence alignments for inferring evolutionary relationships between genomes, we defined two new parameters for BLAST analysis: CIP for Cumulative Identity Percentage and CALP for Cumulative Alignment Length Percentage. CIP = ∑nb ID by (HSP/AL) x 100 corresponds to the cumulative percent of sequence identity observed for all the HSPs divided by the cumulative aligned length (AL) which corresponds to the sum of all HSP lengths. CALP = AL/query length is the sum of the HSP lengths (AL) for all HSPs divided by the length of the query sequence. With these parameters, BLAST produces the highest cumulative percentage identity over the longest cumulative length thereby increasing stringency in defining conservation between two genome sequences.

In my opinion, CIP is simply the sum of num_identical (number of identical residues, 'Bioperl') for all HSPs divided by the AL (sum of all HSP lengths). Am I right? But the formula CIP = ∑nb ID by (HSP/AL) x 100 puzzle me. Can I interprete it as

CIP=∑nb ID and ID=(HSP/AL) x 100,

but what's the meaning of the 'HSP' here ? and 'nb'?

Can you help me?

blast • 8.5k views
ADD COMMENT
0
Entering edit mode

Can you link to the paper so we can look at it?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

AFAIK, HSP or hsp = high-scoring segment pairs (HSPs). You can obtain this from a BLAST output. Ref. http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml. Not sure what's nb is, don't have access to paper. You may also try to get exact details by enquiring with authors as well.

ADD REPLY
2
Entering edit mode
14.3 years ago
Neilfws 49k

They seem quite fond of these statistics. They are also defined in: Structure and expression analysis of rice paleo duplications, where the equation for CIP is described in a slightly different (and clearer) way - "by" is the same as "/":

∑nb ID/HSP/AL x 100

In "Bioperl terms", I would say that:

ID  = $hsp->num_identical
HSP = $hsp->length('total')
AL  = sum of all $hsp->length('total')

So you would first calculate AL from all HSPs. Then you would calculate ID and HSP for each $hsp, divide by AL, multiply by 100 and sum them all together. Basically, it just gives you a percentage identity across the whole sequence by adding together the identities for the HSPs.

ADD COMMENT
2
Entering edit mode
14.3 years ago
Liuyunlong ▴ 130

Thanks first,"add comment" have a word number limitation,so I write my try by this way.

Here is my perl script:

use warnings;
use strict;
use Bio::SearchIO;

my $infile  = shift;

my $parser = Bio::SearchIO->new(-file   => $infile,
                                -format => 'blast');
while( my $result = $parser->next_result ) {
    print "Query: ",$result->query_name,"\tlength: ",$result->query_length,
          "\tnumber of hits: ",$result->num_hits,"\n";
    while(my $hit=$result->next_hit){
        print "\tHit: ",$hit->name,"\tlength: ",$hit->hit_length,
                  "\te-value: ",$hit->significance,"\n";
        my $al=0;
        my @id=();
        while(my $hsp=$hit->next_hsp){
            print "\t\tHsp length: ",$hsp->hsp_length,"\tID length: ",
                          $hsp->num_identical,"\t idenity: ",
                          $hsp->percent_identity,"\n";
            $al=$al+($hsp->hsp_length);
            my $ai=($hsp->num_identical)/($hsp->hsp_length);
            push @id,$ai;
        }
        my $cip=0;
        foreach (@id){
            $cip=$cip+($_/$al*100);
            }
        my $calp=$al/($result->query_length);
        print "\tcip: ",$cip,"\tcalp: ",$calp,"\n";
    }
}

I parsed the blastn result (cutoff: 1e-40,all genes from rice chromosome 11 vs all genes from Brachypodium distachyon chromosome 4 )

single hsp sample output:

Query: LOC_Os11g01010.1 length: 357 number of hits: 1
    Hit: Bradi4g45390.1 length: 366 e-value: 9e-76
        Hsp length: 337 ID length: 288   idenity: 85.459940652819
    cip: 0.253590328346644  calp: 0.943977591036415

and multiple hsps sample output:

Query: LOC_Os11g01380.1 length: 5127    number of hits: 4
    Hit: Bradi4g26880.1 length: 4821    e-value: 0.0
        Hsp length: 2829    ID length: 2576  idenity: 91.0569105691057
        Hsp length: 1437    ID length: 1298  idenity: 90.3270702853166
        Hsp length: 432 ID length: 387   idenity: 89.5833333333333
        cip: 0.0576771635137837         calp: 0.916325336454067

All the CIP value are too low, and nearly no hit is greater than 60% (cutoff value as the paper said). But truth is the two chromosome were orthologous chromosomes and have a large high conserved segment. If my script is right , the result infers that this explanation of the formula may be unreasonable.

ADD COMMENT
0
Entering edit mode

"Incorrect" is the word you want, it's a perfectly reasonable explanation :-) OK, we need to think about this some more. I suspect one problem here is that the authors have used mathematical notation to make their work look more impressive, but they don't really know how to use it.

ADD REPLY
0
Entering edit mode

:)I agree.how about ∑nb in the formula? Is this mathematic expression normal?

By the way, the authors mentioned this formula many times in many different papers(as above two, and more than the two papers),Isn‘t it weird?

ADD REPLY
0
Entering edit mode

I think nb can mean "negative binomial", if that helps.

ADD REPLY

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6