Question

Getting coordinates for a given sequence motif

0

Entering edit mode

3.0 years ago

elcortegano ▴ 200

I would like to get all coordinates for a given series of short sequence motifs, e.g. GATTACA (and its reverse complementary TGTAATC) from a reference genome, but want to do it in a way that allows me to get the coordinates of the sampled motif (because I need to intersect these coordinates with other annotation).

My current approach is based on a strict blastn-short search of that motif. However, this is extremely inefficient, because the minimum length for the algorithm to work requires sequences of a minimum length of 15 bp, so I need to append all possible combinations of nucleotides to the flanks of the motif to reach that minimum length. As consequence, the blast search is not just on a couple or a few motifs, but on thousands of them. For shorter targets, more than inefficient is is plain impractical.

There must be a way to do this more straightforwardly, but have doubts for longer ones, and I am wondering if anybody here has a better idea on how to address this problem.

Thank you,

EDITED: what I want really is not sampling, but getting all genomic coordinates for these motifs.

genome assembly • 1.7k views

ADD COMMENT • link updated 3.0 years ago by ATpoint 85k • written 3.0 years ago by elcortegano ▴ 200

1

Entering edit mode

3.0 years ago

ATpoint 85k

find positions of a short sequence in a genome

ADD COMMENT • link 3.0 years ago by ATpoint 85k

0

Entering edit mode

This looks promising, thanks!

ADD REPLY • link 3.0 years ago by elcortegano ▴ 200

ATpoint · Accepted Answer · 2021-12-07

4

Entering edit mode

3.0 years ago

rpolicastro 13k

In addition to the options presented in ATpoint's link, seqkit locate will work also. It's become my go-to software for many common tasks for fasta and fastq files.

ADD COMMENT • link updated 3.0 years ago by ATpoint 85k • written 3.0 years ago by rpolicastro 13k

2

Entering edit mode

Yeah, this is probably the best solution as it is an efficient command line tool and the solutions in the thread I linked are rather custom code. I always forget about seqkit but it is a super handy and efficient tool, swiss army knife for fasta/q files.

From the seqkit link:

# max mismatch: 1
$ cat t.fa \
  | seqkit locate -p agc -m 1 \
  | csvtk pretty -t
seqID   patternName     pattern strand  start   end    matched
seq     agc           agc       +        1       3     agc
seq     agc           agc       +        7       9     agc
seq     agc           agc       +        11      13    acc
seq     agc           agc       -        8       10    agc
seq     agc           agc       -        2       4     agc

ADD REPLY • link 3.0 years ago by ATpoint 85k