Question

Subtracting FASTAs to get unique sequences?

0

Entering edit mode

3.8 years ago

yosoyellogan • 0

Hello everyone,

I'm in the process of designing complementary capture probes for two strains (type1 and type2) of the same virus, and as such most of their sequences are similar, but each strain has unique regions. I've tiled both genomes for each probe, and I'd like to subtract the two probe sets, leaving me only the sequences that are unique to type2 virus. The idea is that by designing probes for all of type1, and the unique regions of type2, we can save money by not making "redundant" probes for sequences that are conserved both strains.

Is anyone aware of a software that can do this type of subtraction? The probes are 120bp long, and I want to identify which probes differ by >= 5% (>= 6bp). Kind of like a reverse-BLAST Right now I'm BLASTing each probe and remaking them manually, but there are dozens that need to be remade, as well as entire insertions that the current process doesn't account for. Any help would be appreciated. Please feel free to ask for clarifying questions as well.

Examples of the probes:

Sufficiently matching pair:

>strain1:600-720bp
TTGTGGCGGCATCATGTTTTTGGCATGTGTACTTGTCCTCATCGTCGACGCTGTTTTGCAGCTGAGTCCCCTCCTTGGAGCTGTAACTGTGGTTTCCATGACGCTGCTGCTACTGGCTTT

>strain2:600-720bp
TTGTGGCGGCATCATGTTTTTGGCATGTGTACTTGTCCTTATCGTCGACGCTGTTTTGCAGCTGAGTCCCCTCCTTGGAGCTGTAACTGTGGTTTCCATGACGCTGCTGCTACTGGCTTT

Mismatching pair that I would need to remake:

>strain1:5760-5880bp
CCCTCCTCAGAAAACTCTGCATGGAGAAGCTGGACGTGAACCTCCCCCCCAGACCTGTGTGCTGTATTTACAAACACTACAATAAACCCAATGTGCAAATGTGGTTTGTATGGCTACTTT

>strain2:5760-5880bp
CCCTCCTCAGAAAACTCTGCATGGAGAAGCTGGACGTGAACCTTCCCCCCCCCCCCGACCTGTGTGCTGTATTTACAAACACTACAATAAACCCAATGTGCAAATGTGGTTTGTATGGCT

alignment sequence FASTA subtraction • 908 views

ADD COMMENT • link updated 3.8 years ago by Mensur Dlakic ★ 28k • written 3.8 years ago by yosoyellogan • 0

score 1 · Answer 1 · 2021-02-18

1

Entering edit mode

3.8 years ago

Mensur Dlakic ★ 28k

CD-HIT removes redundant sequences by employing identity thresholds. In your case that threshold would be set to 0.95 (95%), and if any two sequences are more than 95% identical it will retain only the longer of the two. In your cases it may remove one of them randomly since they are of identical length.

ADD COMMENT • link 3.8 years ago by Mensur Dlakic ★ 28k

1

Entering edit mode

cd-hit is excellent. There's also a new tool called VSEARCH which aims to be an alternative to the non-free USEARCH

ADD REPLY • link 3.8 years ago by 5heikki 11k

0

Entering edit mode

I was having a hard time installing CD-HIT, so I checked out VSEARCH and managed to install it successfully. Which VSEARCH function would best address my question? Most notably, is it possible to filter to allow some mismatches, since <5% mismatch is acceptable? I'm reading through the documentation now but some advice is always appreciated :)

ADD REPLY • link 3.8 years ago by yosoyellogan • 0

score 0 · Answer 2 · 2021-02-18

0

Entering edit mode

3.8 years ago

GenoMax 148k

You can try CD-HIT (LINK) to cluster the sequences and identify ones that are unique.

You could also use clumpify from BBMap suite. It will work with fasta files : A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD COMMENT • link 3.8 years ago by GenoMax 148k