I'm trying to extract nucleotide sequences from a draft genome based on alignment coordinates from a SAM file. Right now, I'm opening the SAM file as a csv, then matching sequence name and finally grabbing the sequence based on the coordinates.
import csv
import re
from Bio import SeqIO
bam_file=open('orfH9_offtarget.sam', 'rb')
bam_reader=csv.reader(bam_file,delimiter='\t')
offtarget_results = open('offtarget_orfH9.txt', 'w')
unique_seq_names=[]
parsed_data=[]
num=[]
for row in bam_reader:
if row[2]:
genome_seq=SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')
for contig in genome_seq:
if row[2] in contig.id:
if int(row[1]) == 256:
I = int(row[3])
num.append(contig.seq[i:i+20])
elif int(row[1]) == 272:
I = int(row[3])
num.append(contig.seq.reverse_complement()[i:i+20])
elif int(row[1]) == 0:
I = int(row[3])
num.append(contig.seq[i:i+20])
elif int(row[1]) == 16:
I = int(row[3])
num.append(contig.seq.reverse_complement()[i:i+20])
print("length%s" % len(num))
row[1] has flags that dictate whether I pull the reverse complement. 0 and 16 mean exact match while the other two flags mean non exact match. I'm only appending to a list right now to make sure that I successfully get each sequence. It will write to a file when it works.
This code works exactly as intended, but it's SLOW. I suspect the problem is re-opening the draft genome sequence file for every row, but I can't open it outside of the for loop.
Any suggestions? (Keeping in mind that I am a wet-lab scientist that knows a little python)
ok, I know close to nothing of Python, but I understand you are opening the same genome over and over, right?
If this is the case, of course you can move reading the genome to outside the loop. Then you will read it only once and keep it in memory.
I think reading the whole reference sequence and keeping it in memory is the way to go if you want to query many random positions. Even if fasta files are indexed, querying millions of positions will take much much longer then accessing the genome stored memory.
This simple function will read the reference fasta and make a dictionary of it (not as tested and robust as biopython but you drop a dependency):