Here's a demo Python script you can modify for your use, which suggests the rough principle:
#!/usr/bin/env python
import sys
import re
bed = """chr1\t0\t10\tABCDEFGHIJ
chr1\t5\t15\tFGHIJABCDO
chr1\t10\t20\tABCDOPABCD"""
string_to_match = sys.argv[1]
pattern = re.compile(string_to_match)
for line in bed.split("\n"):
(chr, start, stop, id) = line.split("\t")
for match in pattern.finditer(id):
sys.stdout.write("\t".join([chr, str(int(start) + match.start()), str(int(start) + match.end()), string_to_match]) + "\n")
Some sample runs:
$ ./test.py HIJABC
chr1 7 13 HIJABC
$ ./test.py HIJAB
chr1 7 12 HIJAB
$ ./test.py BCDEF
chr1 1 6 BCDEF
$ ./test.py ABCD
chr1 0 4 ABCD
chr1 10 14 ABCD
chr1 16 20 ABCD
One would prepare a BED file from the genome's FASTA (with spanning windows), and then modify the Python script to read that BED file on a line-by-line basis.
If a lot of these searches are done, this is also easily parallelizable, splitting work by chromosome or some other unit of work that matches the environment.
So as to avoid "double-hits", It might be worth piping the result to uniq
, or testing the length of the input string_to_match
to ensure that it is at least half as long as the interleaving spanning elements.
There are perhaps libraries that do a better job of dealing with these and similar edge cases. Still, hopefully this is a useful starting point.
any standard NGS mapper could do that
Not exactly. Most mappers will either report a single hit or top few hits.
bowtie -a seems to report all matches. http://bowtie-bio.sourceforge.net/manual.shtml#example-1--a
This is actually great. A true one-line solution (assuming you already have a Bowtie index).
BBMap will also report all matches when
ambig=all
option is used. You can addperfectmode
flag if you only need perfect matches (and additional options like minid=0.99). It should work with a fasta file but you should confirm.grep + genome.fasta ?
Convert the FASTA to BED in 50k or 100k chunks, with interweaving lines that straddle the window between neighboring chunks, of the same width as the search string (doesn't need to be any longer than 10 bases, say).
For instance:
When you do a
grep
call, you'll get the matching interval and the string containing the match. Pipe that to Perl or Python to do a regex to get the match and position index within the ID string. Add the start position of the interval to the position offset of the match within the ID string, and write out any matches, say, in BED, or whatever useful format.Nice solution, but would a regex have an added value when looking for exact matches only?
That gives you the matching lines, not the position.
what format is the genome file you intended to use?
Standard genome sequence format: FASTA.
Will work to find matches but the output isn't very informative I would say.
I considered to do something similar like you (but haven't yet came to it) and planned to use Hisat2 aligner for it (because of the speed), specify a very high mismatch penalty and disable splicing. Would be interesting to hear your experience with that.
Although BLAT does not work for short sequences, BLAST does. Check BLAST Command Line Applications User Manual and/or the Blast Settings For Short Sequences post.
As pointed out here, PatMatch has been offered a while on the web sites of several genomes. But there is also a stand-alone, command line-based version of the software described in the article that works with any FASTA file to find matches to sequence patterns. Find a list of the genomes offering PatMatch that I have seen, and a launchable, active Jupyter environment with the stand-alone, command line version of PatMatch installed and working at https://github.com/fomightez/patmatch-binder.