Question

Sequence clustering tool that can break sequences?

0

Entering edit mode

4.5 years ago

liorglic ★ 1.4k

Hello,
Is anybody aware of a tool that can cluster genomic sequences (e.g. contigs) while also breaking (splitting) sequences when needed?
Traditional clustering tools (CD-HIT, Blastclust) look at the overall similarity between sequences in order to decide whether they should be clustered together or not. What I'm looking for is a tool that can cluster together parts of sequences and leave out regions that are not similar. Here's an illustration.

Any idea if such a tool exists or how can this be achieved?

Thanks!

clustering • 1.5k views

ADD COMMENT • link 4.5 years ago by liorglic ★ 1.4k

0

Entering edit mode

Sounds a bit like you're looking for local multiple alignment?

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

Not exactly (I think), because in my data set there are thosands of sequences, most of which shouldn't be clustered together whatsoever. So the tool I'm looking for should be able to do both the clustering and the multiple local alignment steps. Maybe I need two different tolls, but not sure how this will work. Any ideas?

ADD REPLY • link 4.5 years ago by liorglic ★ 1.4k

0

Entering edit mode

If your illustration is accurate then you are basically looking to do an ungapped multiple sequence alignment and then break that off at the boundaries where we go from a vertical pileup of 1 -> 2 -> 3 and so on sequence? I can't think of a tool off the top of my head but someone may have come across something.

ADD REPLY • link 4.5 years ago by GenoMax 147k

0

Entering edit mode

Yeah I think Genomax's approach sounds elegant, but I certainly have never come across a tool that would do this.

I guess you could hack this out of the assembly graph? You would just need to isolate all the 'bubbles' between any 2 nodes and then you can probably iteratively filter down the 'chunks'.

If you have a lot of contigs, or they're large, its going to be difficult to do this via alignment, so you might have to get creative with something like the graph.

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

I wanted to reply something similar. I also don't know an existing tool but maybe it helps to look into a method where you re-map the contigs with BWA or minimap and do something with the bam files instead of looking for a clustering method.

Maybe you can go trough every position of the bam/sam file and if the depth changes it will be a start/end for the "cluster". In your case (example):

Start the loop, start depth is 1
If the depth changes (to 2), detect the change and create cluster a
Continue, when the depth change from 2 to 3 you create cluster b etc.

In practice I would first find the positions when the dept changes and after that run a second script that extract the nucleotide's.

ADD REPLY • link 4.5 years ago by gb ★ 2.2k

0

Entering edit mode

4.5 years ago

Mensur Dlakic ★ 28k

It used to be a more active area of research, but I don't have a recent solution to recommend. Some of these will be unavailable, but at least you can read about the algorithms they used to solve the problem.

ADD COMMENT • link 4.5 years ago by Mensur Dlakic ★ 28k

1

Entering edit mode

These tools are all for protein sequences? I think original poster is asking for a tool that will work with DNA sequence.

ADD REPLY • link 4.5 years ago by GenoMax 147k

0

Entering edit mode

Missed that part. Appreciate you looking up the references I posted.

A general strategy of cutting matches in chunks should be the same, though the software for DNA may be different or not even exist.

ADD REPLY • link 4.5 years ago by Mensur Dlakic ★ 28k

score 0 · Accepted Answer · 2020-06-10

0

Entering edit mode

4.5 years ago

liorglic ★ 1.4k

Thanks everybody for your helpful replies. I ended up implementing an iterative algorithm that does not include a clustering step. Basically, at each iteration I map sequences from one genome to the collection of genomic sequences (using minimap2) and extract un-mapped sequences. I then add these to the collection and proceed to the next genome.
This results in something similar to performing clustering and taking the longest representative from each cluster, and it's pretty fast too.
I can share the code if somebody needs it.

ADD COMMENT • link 4.5 years ago by liorglic ★ 1.4k

0

Entering edit mode

extract un-mapped sequences.

Just the part that remains unmapped?

If I understand your explanation you are not doing this exhaustively then. Just whichever genome you pick first. I assume if you start with different genomes then the answer would be different?

ADD REPLY • link 4.5 years ago by GenoMax 147k

1

Entering edit mode

Yes, you're right. Different orders of genomes can result in slightly different results. I'm OK with this heuristic for my current needs, but I guess this won't work in other cases.

ADD REPLY • link 4.5 years ago by liorglic ★ 1.4k