Hi all,
I'm relatively new in Python and I need help in extraction of FASTA sequences using separate lists of start and end coordinates. Here are my codes below:
import sys
import os
from Bio import SeqIO
infile1 = sys.argv[1] # your original FASTA file to extract sequences from
fasta_file = open(infile1)
### To extract your set of start and end coordinates
start_coordinates = os.system("cat indices_trial | awk '{print $3}'")
end_coordinates = os.system("cat indices_trial | awk '{print $5}'")
results_file = raw_input('Enter a new filename to save your results:')
outfile = open(results_file, 'w')
### Now to tell Python to extract those sequences whose coordinates were not specified in 'start_coordinates' & 'end_coordinates' list
for line in SeqIO.parse(fasta_file, "fasta"):
if line.seq != line.seq["%i-1:%i " % start_coordinates, end_coordinates]:
print 'Unpredicted gene:' , line.seq , ',' , 'Sequence length =' , len(line.seq)
else:
print 'Not found.'
Can some python programmer out there help me?
Rosary
if it were any help, here's the execution and the error message that follows:
1 5
4 10
Enter a new filename to save your results:trial_results Traceback (most recent call last): File "extract_UNpredicted_genes.py", line 17, in <module> if line.seq != line.seq["%i:%i + 1" % start_coordinates, end_coordinates]: TypeError: not enough arguments for format string
Hi, please show us a few sample lines of infile1.
Hi,
An example of infile1 would be;
followed by scaffold 3,4,5 and so on, they are basically fasta nucleotide sequences.
Thanks!