Hi,
I'm new to bioinformatics so please bear with me! I have a python script which is running through a series of contigs (in one fasta file) giving back all the ORFs found in each one of them. I used mostly the code from biopython cookbook (slightly modyfing it to give me an id of a sequence and run through several sequences). My problem is that the first ORF it is finding is starting with a negative value. I'm not sure what is causing this and how to fix it. Maybe someone can take a look at the code and output and help me find a problem here (if there is any...). I use Python 3 and Biopython 1.66.
from Bio import SeqIO
record = SeqIO.parse("test.fasta", 'fasta')
table = 1
min_pro_len = 100
ORFs = open("ORF.fasta", "w")
#finding orfs and translating into proteins
def find_orfs_with_trans(seq, trans_table, min_protein_length):
answer = []
seq_len = len(seq)
for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
for frame in range(3):
trans = str(nuc[frame:].translate(trans_table))
trans_len = len(trans)
aa_start = 0
aa_end = 0
while aa_start < trans_len:
aa_end = trans.find("*", aa_start)
if aa_end == -1:
aa_end = trans_len
if aa_end-aa_start >= min_protein_length:
if strand == 1:
start = frame+aa_start*3
end = min(seq_len,frame+aa_end*3+3)
else:
start = seq_len-frame-aa_end*3-3
end = seq_len-frame-aa_start*3
answer.append(( start, end, strand, trans[aa_start:aa_end]))
aa_start = aa_end +1
answer.sort()
return answer
#print results in a console and into a file
for each_record in record:
print each_record.id
orf_list = find_orfs_with_trans(each_record.seq, table, min_pro_len)
for start, end, strand, pro in orf_list:
print "%s...%s - length %i, strand %i, %i:%i" \
% (pro[:30], pro[-3:], len(pro), strand, start, end)
print >> ORFs,each_record.id, pro[:30], pro[-3:], len(pro), strand, start, end)
And the part of the output I'm getting (with a marked negative start and end of a sequence). Only the first contig get negative value. :
('NW_005871048.1', 'NNLQMNLSVLLPMQELHPWPHPRLMHFSPV', 'KRT', 223, -1, **-3, 669)**
('NW_005871048.1', 'GNSFSSSCGVTTGNGCPWLPCHSCSGVRTG', 'VGR', 103, 1, 135, 447)
('NW_005871048.1', 'ESGGTLSTLPDLSDCLTFQPWDGCMQVQQS', 'ESC', 124, 1, 836, 1211)
('NW_005871048.1', 'LSDLPALGWLHAGSAESPASSPHHPHNFLL', 'LRK', 113, 1, 877, 1219)
('NW_005871048.1', 'YRQIYMPRPRRTANTMNNHGNEDDQKENEK', 'HSL', 156, 1, 2204, 2675)
('NW_005871048.1', 'IFESKEQTEYLVTSLQRGPFQVPLVLKLSK', 'PQL', 136, 1, 3020, 3431)
('NW_005871048.1', 'QSGSKTKLNHMLAAKIHLNYKDKVKVNGWE', 'IVH', 116, 1, 3923, 4274)
('NW_005871048.1', 'LANGRCSEGMVNYHVCLLLDRIDDPLYICD', 'LVL', 111, -1, 5507, 5843)
('NW_005871048.1', 'VPSNGVKPAGPGQEQRVKAGEAVESKEGGW', 'ISF', 171, 1, 6367, 6883)
('NW_005871048.1', 'QWSEAGWPGTGAASEGRRGSGEQRGGMGGF', 'FWL', 149, 1, 6375, 6825)
('NW_005871048.1', 'NTVIGITKCDTDGSGHVKGEGQKGTHGLPF', 'AVC', 169, 1, 6931, 7441)
('NW_005871048.1', 'IFHPWNIPFPRESALIPQPGPGGAAQAVGT', 'KKF', 103, -1, 7100, 7412)
('NW_005871048.1', 'ASAPWWPVRIIAIGHSAVQSMCILAFYYYY', 'EPL', 110, -1, 8472, 8805)
('NW_005871048.1', 'ATAAGFPLLLPGAPAGFPLPLPGAVAGFPS', 'ISP', 362, -1, 9244, 10333)
I guess that you're getting -3 value due to the following line:
So, it seems that
seq_len-frame-aa_end*3
operation is giving a result of0
, and then with the-3
, you get a final start position of-3
. Maybe you can add aprint
before this line, and check the values ofseq_len, frame, aa_end
for this given sequence.