Hello group,
I had predicted peptide sequences from denovo assembled contigs using abinitio (GENSCAN) approach and subjected it to similarity (BLASTP) search to identify genes in the assemble sequences. But the difficulty i am facing is with minimum percent of identity and coverage of blast hits. What should be the minimum threshold for percent identity and coverage so that it can be said for sure that the gene is present? This is a eukaryotic genome data.
What nice paper! Not too long, not too short, with a simple summary on recommended parameter settings. Should be a required reading in bioinformatics. I will add this to the training list of recommended papers to read.
Other interesting tidbit, it is from the author of the FASTA suite hence the FASTA format ...
That's quite a relaxed e-value threshold. I would say that 1e-6 is used more commonly, but it isn't going to guarantee anything, especially so with multi-domain eukaryotic proteins. Even much more strict e-value threshold, like say 1e-60 isn't going to guarantee much, since such e-value can be due to one shared domain between the query and subject sequences. OP is on the right track with applying some kind of coverage threshold. I've listed the relevant specifiers below. I would personally feel relatively confident with something like
qlen/slen=1±0.25 && qlen/alen=1±0.25
.qlen
- Query sequence lengthslen
- Subject sequence lengthlength
- Alignment lengthThanks for the above paper.
Paper is stating, e-values and bit scores (bits > 50) is more sensitive and reliable source for inferring homology. I had filtered blast hits based on the above parameters, but the confusion still remain, what percent of coverage (% of length of the gene sequence covered in the alignment or how much length of the gene covered in the alignment) hits should have?
some of hits showing higher identity and bitscore > 50, but only covered 5-10% of the gene sequence. can we consider this as gene? is there any defined threshold for coverage of the alignment