Hey guys
- Self hit I have this actually a bit weird question about blast.
I’ve been doing some work around single copy genome construction using Reciprocal best blast hit (RBBH) method.
As I have something like 100+ annotated genome, I concatenated all annotated CDS into one fasta and makeblastdb with this fasta. Subsequently I blasted this fasta against the db I just made which resulted in a tabular out put with first column: qseqid, second column: sseqid.
With some python scripts I always (I did all-v-all blast multiple times) found that some qseids and sseqids are only present in one of this 2 columns which means that such ids do not have a “self-hit” in the out put of all-vs-all blast.
This is a bit annoying when later I am trying to construct orthologous gene cluster by “reciprocal best” method. So, I really wonder why this happens.
Blast version: 2.12.0+
- Orthogroup clustering and single copy genome The way I did to obtain ortho-group is to fist obtain ortho-pair. e.g., (A, B), (A, C), (B, C),(D, F)…. By using single linkage clustering algorithm, we can get 2 clusters, (A, B, C) and (D, F). However, when I am trying to get single copy core genome, like this: e.g, if we have 3 genomes
normally:
cluster1: strain1-gene1, strain2-gene1, strain3-gene1
cluster2: strain1-gene2, strain2-gene2, strain3-gene2
… Therefore we can construct core genome:
Strain1: strain1-gene1+ strain1-gene2…\
Strain2: strain2-gene1+ strain2-gene2…\
Strain3: strain3-gene1+ strain3-gene2…\
However, sometimes, by single linkage clustering, in one cluster of gene, one strain may have more than one due to linkage extension:
e.g., we have following ortho-pairs:
(strain1-gene111, strain2-gene111)\
(strain2-gene111, strain3-gene111)\
(strain1-gene111, strain3-gene222)\
So you have a cluster: strain1-gene111, strain2-gene111, strain3-gene111, strain3-gene222
In this case, when constructing the the core genome:
Strain1: strain1-gene1+ strain1-gene2…strain1-gene111\
Strain2: strain2-gene1+ strain2-gene2…strain2-gene111\
Strain3: strain3-gene1+ strain3-gene2…strain3-gene111 + strain3-gene222\
OR
Strain1: strain1-gene1+ strain1-gene2…strain1-gene111\
Strain2: strain2-gene1+ strain2-gene2…strain2-gene111\
Strain3: strain3-gene1+ strain3-gene2…strain3-gene222 + strain3-gene111\
And these 2 different concatenation order will put unknown effect when later alignment is performed and when phylogenetic relationship is compared.
So I really would like to consult about where or how should I stop the single linkage extension when such circumstances are met.
Beste Rongyin
Hi Thanks a lot.
So simply speaking, the blast algorithm will not return a self hit sometimes, may I ask what's the reason? randome process? or masking?
Also, do you have any suggestion of other software or algorithm for reciprocal analysis?
Beste
Rongyin
as said before ;) reason: masking
an alternative not directly, the best you can do is turn masking/filtering off and see what you get. Since you're most likely mainly interested in the best/top/... hits this will likely not affect the overall result too much.
Thanks a lot!