Subtracting FASTAs to get unique sequences?
2
0
Entering edit mode
3.8 years ago

Hello everyone,

I'm in the process of designing complementary capture probes for two strains (type1 and type2) of the same virus, and as such most of their sequences are similar, but each strain has unique regions. I've tiled both genomes for each probe, and I'd like to subtract the two probe sets, leaving me only the sequences that are unique to type2 virus. The idea is that by designing probes for all of type1, and the unique regions of type2, we can save money by not making "redundant" probes for sequences that are conserved both strains.

Is anyone aware of a software that can do this type of subtraction? The probes are 120bp long, and I want to identify which probes differ by >= 5% (>= 6bp). Kind of like a reverse-BLAST Right now I'm BLASTing each probe and remaking them manually, but there are dozens that need to be remade, as well as entire insertions that the current process doesn't account for. Any help would be appreciated. Please feel free to ask for clarifying questions as well.

Examples of the probes:

Sufficiently matching pair:

>strain1:600-720bp
TTGTGGCGGCATCATGTTTTTGGCATGTGTACTTGTCCTCATCGTCGACGCTGTTTTGCAGCTGAGTCCCCTCCTTGGAGCTGTAACTGTGGTTTCCATGACGCTGCTGCTACTGGCTTT

>strain2:600-720bp
TTGTGGCGGCATCATGTTTTTGGCATGTGTACTTGTCCTTATCGTCGACGCTGTTTTGCAGCTGAGTCCCCTCCTTGGAGCTGTAACTGTGGTTTCCATGACGCTGCTGCTACTGGCTTT

Mismatching pair that I would need to remake:

>strain1:5760-5880bp
CCCTCCTCAGAAAACTCTGCATGGAGAAGCTGGACGTGAACCTCCCCCCCAGACCTGTGTGCTGTATTTACAAACACTACAATAAACCCAATGTGCAAATGTGGTTTGTATGGCTACTTT

>strain2:5760-5880bp
CCCTCCTCAGAAAACTCTGCATGGAGAAGCTGGACGTGAACCTTCCCCCCCCCCCCGACCTGTGTGCTGTATTTACAAACACTACAATAAACCCAATGTGCAAATGTGGTTTGTATGGCT
alignment sequence FASTA subtraction • 908 views
ADD COMMENT
1
Entering edit mode
3.8 years ago
Mensur Dlakic ★ 28k

CD-HIT removes redundant sequences by employing identity thresholds. In your case that threshold would be set to 0.95 (95%), and if any two sequences are more than 95% identical it will retain only the longer of the two. In your cases it may remove one of them randomly since they are of identical length.

ADD COMMENT
1
Entering edit mode

cd-hit is excellent. There's also a new tool called VSEARCH which aims to be an alternative to the non-free USEARCH

ADD REPLY
0
Entering edit mode

I was having a hard time installing CD-HIT, so I checked out VSEARCH and managed to install it successfully. Which VSEARCH function would best address my question? Most notably, is it possible to filter to allow some mismatches, since <5% mismatch is acceptable? I'm reading through the documentation now but some advice is always appreciated :)

ADD REPLY
0
Entering edit mode
3.8 years ago
GenoMax 148k

You can try CD-HIT (LINK) to cluster the sequences and identify ones that are unique.

You could also use clumpify from BBMap suite. It will work with fasta files : A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD COMMENT

Login before adding your answer.

Traffic: 2891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6