I would like to get all coordinates for a given series of short sequence motifs, e.g. GATTACA (and its reverse complementary TGTAATC) from a reference genome, but want to do it in a way that allows me to get the coordinates of the sampled motif (because I need to intersect these coordinates with other annotation).
My current approach is based on a strict blastn-short search of that motif. However, this is extremely inefficient, because the minimum length for the algorithm to work requires sequences of a minimum length of 15 bp, so I need to append all possible combinations of nucleotides to the flanks of the motif to reach that minimum length. As consequence, the blast search is not just on a couple or a few motifs, but on thousands of them. For shorter targets, more than inefficient is is plain impractical.
There must be a way to do this more straightforwardly, but have doubts for longer ones, and I am wondering if anybody here has a better idea on how to address this problem.
Thank you,
EDITED: what I want really is not sampling, but getting all genomic coordinates for these motifs.
Yeah, this is probably the best solution as it is an efficient command line tool and the solutions in the thread I linked are rather custom code. I always forget about seqkit but it is a super handy and efficient tool, swiss army knife for fasta/q files.
From the seqkit link: