Hi, I am in charge to find presence of specific genes in 600.000 Salmonella's genomes. Some people advised me to use COBS for the indexing So I used it on few genomes just for training
But I don't really understand the output...
I copied a subsequence (55 bp) from one of my genomes, and run COBS to see if it get it. In the output I got 24 (see bellow).
**output:**
SRR18349609 24
SRR18349610 24
SRR18349611 24
Is that mean it got 24 hits on my 55bp query?
And Is it possible to get more information in the output, like e/p value, location etc.. (like blastn)
Thank you for all!!
You may also try KMCP which uses an index structure similar to COBS while with a faster searching speed. While the speedup is not obvious in the 661K dataset cause there are a huge number of similar genomes which results in too many hits for a query, therefore writing results becomes the performance bottleneck.
It is looks like great! I'll try it!
Thank you very much
Thanks it is very helpful!
And if there is multiple matchs in the same genome (repeated sequences), it will change the output?
No, these kinds of methods can't tell the number of sequence copies or sequence locations. They just tell whether the query is contained in the subject, and there are some false positives (see BIGSI/COBS paper).
God bless you
Thank you for all!