I try to get an assembly stat.
According to some articles, it tends to have information that can be obtained by BLAST.
For example, "Top BLASTx-hit species" and "Percent of gene with at least one BLASTx hit (E ≤ 1.0-3)" described in table S1 of this paper fall into this.
It tells you something about the gene content of your assembly.
For the % gene that have a blastx hit : Therefore that percentage will tell you have many genes are 'known' (== have been seen in an other genome). It slightly tells you something about the quality of the gene-content/gene-annotation. given that we have sequenced a fair number of species in the meantime you can make the assumption that many genes will have been seen in other genomes. If you would see that a large percentage of your genes has no hit, it could point to bad annotation quality (you annotated many spurious and false positive genes).
For the top hit species one: if you sequenced a eukaryote and the top hits of many of your genes have for instance bacterial hits, you can see that as an indication of contamination. You would on top of that that the best hits should come from phylogenetic closely related species. If that's not the case it could also point to potential issues.
Those stats can also be derived btw from other blast analyses (eg,blastp) so that is nothing specific for Blastx
however, running blastx on the 'raw' genomic is somewhat less intuitive to interpret. If you have annotations (genes) you'd better off running it with that.
Thank you very much for your advices!
I see. It's an indicator of the quality of the assembly.
Can this be done by running BLAST on the fasta file? I'm thinking of running "blastx" in a terminal.
yes, that is very possible indeed.
however, running blastx on the 'raw' genomic is somewhat less intuitive to interpret. If you have annotations (genes) you'd better off running it with that.
I will running BLAST according to you.
Your advices is very helpful for me. Thank you very much for your coopration!