Question

Tool:Compareads: Comparing Huge Metagenomic Experiments

4

Entering edit mode

12.3 years ago

Nico ▴ 180

Hi BioStar,

We would like to announce the open-source release of a new tool to compare huge metagenomic samples: Compareads. The goal of Compareads is to find all the similar reads between two samples and to give a similarity score based on those shared reads.

We consider that two reads (one from each sample) are similar if they share at least m non-overlapping k-mers. Compareads is designed to find those similar sequences between two samples. In a few words, given two read sets A and B, the goal of Compareads is to find the subset of reads from A which are similar to a read in B, and the subset of reads from B which are similar to a read in A.

On the publication, we show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4GB of memory, and thus usable on today's personal computers.

Download link and PDF article: http://alcovna.genouest.org/compareads/

Looking forward to hearing your feedback,

Nicolas

metagenomics denovo next-gen • 3.0k views

ADD COMMENT • link updated 22 months ago by Ram 45k • written 12.3 years ago by Nico ▴ 180

score 3 · Answer 1 · 2013-01-29

One suggestion if I may is that it would help if you provided a more explicit usage scenario for the purpose of the tool. Comparing is a somewhat generic concept and it is not clear neither from the description above nor that of the tool of what one would expect as the resulting information. Especially that the authors also mention that they don't know of other tools that have similar functionality. So that pretty much leaves the reader with nothing to compare it to.

Say I have two large metagenomics samples and I run the tool. What is the output? Reads with counts in the each sample? Sub-sequences that can be found in both samples? Can I run it by giving it randomly sheared bacterial sequences and thus perform some sort of classification with it? Can I use it to de-duplicate samples?

I would just give a simple example in the docs.

score 2 · Answer 2 · 2013-01-29

2

Entering edit mode

12.3 years ago

Manu Prestat 4.1k

It looks interesting. It's a way to address a problem: -- how to cluster samples according to all sequencing information and not only the information in the bio databanks? -- For that purpose, I personally had been following the "blast all vs all" process a couple of years ago which, indeed as stated in the paper, is not a good solution in terms of computational time anymore. I've not looked into the paper very closely yet, but I will!

I'm wondering how you guys deal with the different size of datasets though, but it might be in the paper I've just looked up so far.

ADD COMMENT • link 12.3 years ago by Manu Prestat 4.1k

0

Entering edit mode

Thank you! Size of samples are not really important to find wich reads from A occur in B and vice versa. But, it does matter when you look at the similarity score based on those shared reads.

For the moment, the similarity score is normalized by size of samples, but with extrem different size it might be not that relevant. So, we add a basic option on the software to only use the X first reads of samples. For example, let A be a sample with 1 million reads and B with 500 million. The software can run with only the first million reads from B and full A.

For that purpose, we are currently studying how Compareads can perform subsampling of datasets and still be reliable.

ADD REPLY • link 12.3 years ago by Nico ▴ 180