Locating A Sequence In A Fasta File.
3
0
Entering edit mode
11.7 years ago
KCC ★ 4.1k

I have a nucleotide sequence and a genome in fasta file format. What's the easiest way to query the location in my fasta file of my nucleotide sequence via a command line script or software? I am only interested in exact matches.

fasta • 9.3k views
ADD COMMENT
3
Entering edit mode

use blat .........

ADD REPLY
3
Entering edit mode
11.7 years ago

If you have a lot of RAM or your genome fasta file is fragmented in small pieces, you can just use a python script (requires biopython). Something like this will work:

import sys
from Bio import SeqIO

query = sys.argv[2].lower()

for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta'):
    indeces = []
    i = str(record.seq).lower().find(query)
    while i >= 0:
        indeces.append(str(i + 1))
        i =  str(record.seq).lower().find(query, i + 1)

    if len(indeces) > 0:
        print record.id + "\t" + ','.join(indeces)

Use by:

python script.py genome.fasta 'AGATAGATATATAGATAGA'

Took a minute or two on my computer with a ~800 megabyte genome in around 40,000 pieces.

ADD COMMENT
2
Entering edit mode
11.7 years ago

You can use Biopieces in particular patscan_seq:

read_fasta -i genome.fna | patscan_seq -p AGATAGATATATAGATAGA | write_bed -x
ADD COMMENT
0
Entering edit mode
11.7 years ago
Anima Mundi ★ 2.9k

Another solution in Python:

import sys

query = raw_input('Enter the query:\n')

subject = raw_input('Enter the subject:\n')

if query not in subject:
    sys.exit('Query not in subject')

n = len(query)
m = len(subject)
j = 0

while j <= m:
    if query == subject[j:j+n]:
        print 'Found match at positions', j+1, '-', j+n
    j += 1
ADD COMMENT
0
Entering edit mode

Hi Anima,

What does "Enter the subject:\n" means?

Thanks

ADD REPLY
0
Entering edit mode

Hi jleandroj,

When asked to enter the subject, you should copy-paste the genome sequence from its FASTA (or whatever sequence you want to parse for occurrences of the first one). Please just note that, as it is, the script is case-sensitive and does not allow newlines. Hope this helps.

ADD REPLY

Login before adding your answer.

Traffic: 1845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6