What should be the minimum percent of identity and coverage of blast hits for considering as gene sequence
1
0
Entering edit mode
9.1 years ago

Hello group,

I had predicted peptide sequences from denovo assembled contigs using abinitio (GENSCAN) approach and subjected it to similarity (BLASTP) search to identify genes in the assemble sequences. But the difficulty i am facing is with minimum percent of identity and coverage of blast hits. What should be the minimum threshold for percent identity and coverage so that it can be said for sure that the gene is present? This is a eukaryotic genome data.

blast gene alignment sequence • 12k views
ADD COMMENT
5
Entering edit mode
9.1 years ago
Renesh ★ 2.2k

For BLASTp, you should look for the alignments with e-value < 0.001 (1e-3) to infer the given gene is present.

More detail: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/

ADD COMMENT
0
Entering edit mode

What nice paper! Not too long, not too short, with a simple summary on recommended parameter settings. Should be a required reading in bioinformatics. I will add this to the training list of recommended papers to read.

Other interesting tidbit, it is from the author of the FASTA suite hence the FASTA format ...

ADD REPLY
0
Entering edit mode

That's quite a relaxed e-value threshold. I would say that 1e-6 is used more commonly, but it isn't going to guarantee anything, especially so with multi-domain eukaryotic proteins. Even much more strict e-value threshold, like say 1e-60 isn't going to guarantee much, since such e-value can be due to one shared domain between the query and subject sequences. OP is on the right track with applying some kind of coverage threshold. I've listed the relevant specifiers below. I would personally feel relatively confident with something like qlen/slen=1±0.25 && qlen/alen=1±0.25.

  • qlen - Query sequence length
  • slen - Subject sequence length
  • length - Alignment length
ADD REPLY
0
Entering edit mode

Thanks for the above paper.

Paper is stating, e-values and bit scores (bits > 50) is more sensitive and reliable source for inferring homology. I had filtered blast hits based on the above parameters, but the confusion still remain, what percent of coverage (% of length of the gene sequence covered in the alignment or how much length of the gene covered in the alignment) hits should have?

some of hits showing higher identity and bitscore > 50, but only covered 5-10% of the gene sequence. can we consider this as gene? is there any defined threshold for coverage of the alignment

ADD REPLY

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6