Hello,
Is anybody aware of a tool that can cluster genomic sequences (e.g. contigs) while also breaking (splitting) sequences when needed?
Traditional clustering tools (CD-HIT, Blastclust) look at the overall similarity between sequences in order to decide whether they should be clustered together or not. What I'm looking for is a tool that can cluster together parts of sequences and leave out regions that are not similar. Here's an illustration.
Any idea if such a tool exists or how can this be achieved?
Thanks!
Sounds a bit like you're looking for local multiple alignment?
Not exactly (I think), because in my data set there are thosands of sequences, most of which shouldn't be clustered together whatsoever. So the tool I'm looking for should be able to do both the clustering and the multiple local alignment steps. Maybe I need two different tolls, but not sure how this will work. Any ideas?
If your illustration is accurate then you are basically looking to do an ungapped multiple sequence alignment and then break that off at the boundaries where we go from a vertical pileup of 1 -> 2 -> 3 and so on sequence? I can't think of a tool off the top of my head but someone may have come across something.
Yeah I think Genomax's approach sounds elegant, but I certainly have never come across a tool that would do this.
I guess you could hack this out of the assembly graph? You would just need to isolate all the 'bubbles' between any 2 nodes and then you can probably iteratively filter down the 'chunks'.
If you have a lot of contigs, or they're large, its going to be difficult to do this via alignment, so you might have to get creative with something like the graph.
I wanted to reply something similar. I also don't know an existing tool but maybe it helps to look into a method where you re-map the contigs with BWA or minimap and do something with the bam files instead of looking for a clustering method.
Maybe you can go trough every position of the bam/sam file and if the depth changes it will be a start/end for the "cluster". In your case (example):
In practice I would first find the positions when the dept changes and after that run a second script that extract the nucleotide's.