Question

Locating A Sequence In A Fasta File.

0

Entering edit mode

12.4 years ago

KCC ★ 4.1k

I have a nucleotide sequence and a genome in fasta file format. What's the easiest way to query the location in my fasta file of my nucleotide sequence via a command line script or software? I am only interested in exact matches.

fasta • 10k views

ADD COMMENT • link updated 12.4 years ago by SES 8.6k • written 12.4 years ago by KCC ★ 4.1k

3

Entering edit mode

use blat .........

ADD REPLY • link 12.4 years ago by Rm 8.3k

Ram · Answer 1 · 2013-03-18

If you have a lot of RAM or your genome fasta file is fragmented in small pieces, you can just use a python script (requires biopython). Something like this will work:

import sys
from Bio import SeqIO

query = sys.argv[2].lower()

for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta'):
    indeces = []
    i = str(record.seq).lower().find(query)
    while i >= 0:
        indeces.append(str(i + 1))
        i =  str(record.seq).lower().find(query, i + 1)

    if len(indeces) > 0:
        print record.id + "\t" + ','.join(indeces)

Use by:

python script.py genome.fasta 'AGATAGATATATAGATAGA'

Took a minute or two on my computer with a ~800 megabyte genome in around 40,000 pieces.

Ram · Answer 2 · 2013-03-18

2

Entering edit mode

12.4 years ago

Martin A Hansen 3.0k

You can use Biopieces in particular patscan_seq:

read_fasta -i genome.fna | patscan_seq -p AGATAGATATATAGATAGA | write_bed -x

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 12.4 years ago by Martin A Hansen 3.0k

Ram · Answer 3 · 2013-03-18

0

Entering edit mode

12.4 years ago

Anima Mundi ★ 2.9k

Another solution in Python:

import sys

query = raw_input('Enter the query:\n')

subject = raw_input('Enter the subject:\n')

if query not in subject:
    sys.exit('Query not in subject')

n = len(query)
m = len(subject)
j = 0

while j <= m:
    if query == subject[j:j+n]:
        print 'Found match at positions', j+1, '-', j+n
    j += 1

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 12.4 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

Hi Anima,

What does "Enter the subject:\n" means?

Thanks

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 6.9 years ago by jleandroj • 0

0

Entering edit mode

Hi jleandroj,

When asked to enter the subject, you should copy-paste the genome sequence from its FASTA (or whatever sequence you want to parse for occurrences of the first one). Please just note that, as it is, the script is case-sensitive and does not allow newlines. Hope this helps.

ADD REPLY • link 6.9 years ago by Anima Mundi ★ 2.9k