Question

Biopython - calculating query coverage and Identity

3

Entering edit mode

8.8 years ago

efratefrat120 ▴ 30

Hi, I'm using biopython for the first time to analyze blastp results. I understand how to extract the E-value, but not sure as to how to calculate the "query cover" and "ident" fields that appear in the website gui. I would appreciate any help! Thanks, Efrat

biopython python blast blastp • 16k views

ADD COMMENT • link updated 7.2 years ago by devxpy ▴ 20 • written 8.8 years ago by efratefrat120 ▴ 30

2

Entering edit mode

Last time I checked 2 years ago, BioPython did not have a good built in function to calculate the query coverage from blast output. Bummer! (This is seemingly because BioPython doesn't have a function for tiling HSPs) BioPerl has (and maybe BioPython developers have added this in the last 2 years, or I simply couldn't find it ;). The crux is that the HSP need to be tiled correctly. If you see any solution based on a trivial calculation (query_end - query_start) / query_length, or even including gap-length:These solutions are incorrect because they are not tiling the HSPs correctly. (E.g. this one: https://codereview.stackexchange.com/questions/39879/calculate-query-coverage-from-blast-output)

Instead, the query coverage per subject or per hsp can be output using tabular format either with option qcovhsp or qcovs, e.g.:

-outfmt 6 std qcovs

Will output the coverage per subject as the last column.

see https://www.ncbi.nlm.nih.gov/books/NBK279675/

ADD REPLY • link 8.8 years ago by Michael 55k

0

Entering edit mode

Thank you Michael! I don't mind doing the calculation myself and not using a built in function, but I'm not sure I understand how these values are calculated - are you familiar with a reference that might help me?

ADD REPLY • link 8.8 years ago by efratefrat120 ▴ 30

0

Entering edit mode

There is no correct way of calculating the query coverage without tiling the HSPs first. The safest way would be to either use the BioPerl builtin function or the Blast tabular output.

I am linking the code that does this in BioPerl, note that it depends on the other methods

tile_hsps, logical_length, and length_aln

http://doc.bioperl.org/bioperl-live/Bio/Search/Hit/GenericHit.html#CODE31

ADD REPLY • link 8.8 years ago by Michael 55k

0

Entering edit mode

Thank you! That's very helpful. What about the identity? is there a way to calculate that?

ADD REPLY • link 8.8 years ago by efratefrat120 ▴ 30

0

Entering edit mode

(query_end - query_start + 1)

ADD REPLY • link 4.2 years ago by bjwiley23 ▴ 40

score 2 · Answer 1 · 2017-11-02

2

Entering edit mode

7.2 years ago

devxpy ▴ 20

blast_record = NCBIXML.parse(open('myfile.xml', 'r'))
for query in blast_record:
    for alignment in query.alignments:
        for hsp in alignment.hsps:
            print('coverage', hsp.align_length / query.query_length)
            print('identitiy', hsp.identities/ hsp.align_length)

This is what I used for a project I am working on.

ADD COMMENT • link 7.2 years ago by devxpy ▴ 20

0

Entering edit mode

So, hsp.identities/hsp.align_length is percent identify?

ADD REPLY • link 4.8 years ago by dandanli0365 • 0

0

Entering edit mode

Coverage is NOT hsp.align_length in the numerator. There could be gaps in the query that consume the aligned sequence but not the query (think cigar in SAM). Instead coverage of query is (I am using the XML from Blast but should have similar names in BioPython) calculated by floor((Hsp_query-to - Hsp_query-from + 1)/BlastOutput_query-len). You need to calculate the length of the query that is aligned, not the length of the aligned sequence that is aligned for the numerator. Test HIT #5 for blastp for "P13773". The coverage should be 85% (337 residues consumed of the query/392 query length). The hsp.align_length is 349 from the aligned sequence (so there are 12 gaps in the coverage of query).

https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&VIEW_RESULTS=FromRes&RID=SFDKNWMC016&UNIQ_OBJ_NAME=A_SearchResults_1kStBG_1bKj_dqLNz0M48Op_GTXQl_21Um0o&QUERY_INDEX=0

ADD REPLY • link 4.2 years ago by bjwiley23 ▴ 40

score 0 · Answer 2 · 2016-03-08

0

Entering edit mode

8.8 years ago

Bioinformatics_NewComer ▴ 330

If I understand you correctly, you're attempting to read output from BLAST. If you run BLAST locally, with output format 6 you'd be able to parse the e-value, ident field pretty smoothly.

ADD COMMENT • link 8.8 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

I'm not running the BLAST locally, I'm using the remote option - is there any reason not to run it locally if I have the full nt database on my server?

ADD REPLY • link 8.8 years ago by efratefrat120 ▴ 30

score 0 · Answer 3 · 2016-03-08

Which output format are you using when you run blast? I believe that percent identity is in the list of default output fields. You can specify that blast also store the query coverage in the output.

For example -outfmt "6 std qcovs" would output the default blast fields (qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore) as well as query coverage, in that order. See the output of blastn/p/x -help for more information.

As for calculating it in biopython, it looks like qcovs is roughly equal to (hit_length - gap_open) / query_length. You'd need the lengths of your query sequences.