I have a nucleotide sequence and a genome in fasta file format. What's the easiest way to query the location in my fasta file of my nucleotide sequence via a command line script or software? I am only interested in exact matches.
I have a nucleotide sequence and a genome in fasta file format. What's the easiest way to query the location in my fasta file of my nucleotide sequence via a command line script or software? I am only interested in exact matches.
If you have a lot of RAM or your genome fasta file is fragmented in small pieces, you can just use a python script (requires biopython). Something like this will work:
import sys
from Bio import SeqIO
query = sys.argv[2].lower()
for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta'):
indeces = []
i = str(record.seq).lower().find(query)
while i >= 0:
indeces.append(str(i + 1))
i = str(record.seq).lower().find(query, i + 1)
if len(indeces) > 0:
print record.id + "\t" + ','.join(indeces)
Use by:
python script.py genome.fasta 'AGATAGATATATAGATAGA'
Took a minute or two on my computer with a ~800 megabyte genome in around 40,000 pieces.
You can use Biopieces in particular patscan_seq:
read_fasta -i genome.fna | patscan_seq -p AGATAGATATATAGATAGA | write_bed -x
Another solution in Python:
import sys
query = raw_input('Enter the query:\n')
subject = raw_input('Enter the subject:\n')
if query not in subject:
sys.exit('Query not in subject')
n = len(query)
m = len(subject)
j = 0
while j <= m:
if query == subject[j:j+n]:
print 'Found match at positions', j+1, '-', j+n
j += 1
Hi jleandroj,
When asked to enter the subject, you should copy-paste the genome sequence from its FASTA (or whatever sequence you want to parse for occurrences of the first one). Please just note that, as it is, the script is case-sensitive and does not allow newlines. Hope this helps.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
use blat .........