Hi BioStar,
We would like to announce the open-source release of a new tool to compare huge metagenomic samples: Compareads. The goal of Compareads is to find all the similar reads between two samples and to give a similarity score based on those shared reads.
We consider that two reads (one from each sample) are similar if they share at least m non-overlapping k-mers. Compareads is designed to find those similar sequences between two samples. In a few words, given two read sets A and B, the goal of Compareads is to find the subset of reads from A which are similar to a read in B, and the subset of reads from B which are similar to a read in A.
On the publication, we show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4GB of memory, and thus usable on today's personal computers.
Download link and PDF article: http://alcovna.genouest.org/compareads/
Looking forward to hearing your feedback,
Nicolas
Ok, you are right, it is not really clear. I updated the post to add a little more information. I also updated the docs to add a toy example!
Thanks