ORF search gives negative start value
1
1
Entering edit mode
8.7 years ago

Hi,

I'm new to bioinformatics so please bear with me! I have a python script which is running through a series of contigs (in one fasta file) giving back all the ORFs found in each one of them. I used mostly the code from biopython cookbook (slightly modyfing it to give me an id of a sequence and run through several sequences). My problem is that the first ORF it is finding is starting with a negative value. I'm not sure what is causing this and how to fix it. Maybe someone can take a look at the code and output and help me find a problem here (if there is any...). I use Python 3 and Biopython 1.66.

from Bio import SeqIO


record = SeqIO.parse("test.fasta", 'fasta')
table = 1
min_pro_len = 100
ORFs = open("ORF.fasta", "w")

#finding orfs and translating into proteins
def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end-aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame+aa_start*3
                        end = min(seq_len,frame+aa_end*3+3)
                    else:
                        start = seq_len-frame-aa_end*3-3
                        end = seq_len-frame-aa_start*3
                    answer.append(( start, end, strand, trans[aa_start:aa_end]))
                aa_start = aa_end +1
    answer.sort()
    return answer

#print results in a console and into a file
for each_record in record:
    print each_record.id
    orf_list = find_orfs_with_trans(each_record.seq, table, min_pro_len)
    for start, end, strand, pro in orf_list:
        print "%s...%s - length %i, strand %i, %i:%i" \
            % (pro[:30], pro[-3:], len(pro), strand, start, end)

        print >> ORFs,each_record.id, pro[:30], pro[-3:], len(pro), strand, start, end)

And the part of the output I'm getting (with a marked negative start and end of a sequence). Only the first contig get negative value. :

('NW_005871048.1', 'NNLQMNLSVLLPMQELHPWPHPRLMHFSPV', 'KRT', 223, -1, **-3, 669)**
('NW_005871048.1', 'GNSFSSSCGVTTGNGCPWLPCHSCSGVRTG', 'VGR', 103, 1, 135, 447)
('NW_005871048.1', 'ESGGTLSTLPDLSDCLTFQPWDGCMQVQQS', 'ESC', 124, 1, 836, 1211)
('NW_005871048.1', 'LSDLPALGWLHAGSAESPASSPHHPHNFLL', 'LRK', 113, 1, 877, 1219)
('NW_005871048.1', 'YRQIYMPRPRRTANTMNNHGNEDDQKENEK', 'HSL', 156, 1, 2204, 2675)
('NW_005871048.1', 'IFESKEQTEYLVTSLQRGPFQVPLVLKLSK', 'PQL', 136, 1, 3020, 3431)
('NW_005871048.1', 'QSGSKTKLNHMLAAKIHLNYKDKVKVNGWE', 'IVH', 116, 1, 3923, 4274)
('NW_005871048.1', 'LANGRCSEGMVNYHVCLLLDRIDDPLYICD', 'LVL', 111, -1, 5507, 5843)
('NW_005871048.1', 'VPSNGVKPAGPGQEQRVKAGEAVESKEGGW', 'ISF', 171, 1, 6367, 6883)
('NW_005871048.1', 'QWSEAGWPGTGAASEGRRGSGEQRGGMGGF', 'FWL', 149, 1, 6375, 6825)
('NW_005871048.1', 'NTVIGITKCDTDGSGHVKGEGQKGTHGLPF', 'AVC', 169, 1, 6931, 7441)
('NW_005871048.1', 'IFHPWNIPFPRESALIPQPGPGGAAQAVGT', 'KKF', 103, -1, 7100, 7412)
('NW_005871048.1', 'ASAPWWPVRIIAIGHSAVQSMCILAFYYYY', 'EPL', 110, -1, 8472, 8805)
('NW_005871048.1', 'ATAAGFPLLLPGAPAGFPLPLPGAVAGFPS', 'ISP', 362, -1, 9244, 10333)
python ORFs • 2.3k views
ADD COMMENT
0
Entering edit mode

I guess that you're getting -3 value due to the following line:

start = seq_len-frame-aa_end*3-3

So, it seems that seq_len-frame-aa_end*3 operation is giving a result of 0, and then with the -3, you get a final start position of -3. Maybe you can add a print before this line, and check the values of seq_len, frame, aa_end for this given sequence.

ADD REPLY
0
Entering edit mode
8.7 years ago

Thanks for answering!

You mean print before start line...? If I get rid of this -3 I'm getting output like that:

   ('NW_005871048.1', 'NNLQMNLSVLLPMQELHPWPHPRLMHFSPV', 'KRT', 223, -1, 0, 669)
('NW_005871048.1', 'GNSFSSSCGVTTGNGCPWLPCHSCSGVRTG', 'VGR', 103, 1, 135, 447)
('NW_005871048.1', 'ESGGTLSTLPDLSDCLTFQPWDGCMQVQQS', 'ESC', 124, 1, 836, 1211)
('NW_005871048.1', 'LSDLPALGWLHAGSAESPASSPHHPHNFLL', 'LRK', 113, 1, 877, 1219)
('NW_005871048.1', 'YRQIYMPRPRRTANTMNNHGNEDDQKENEK', 'HSL', 156, 1, 2204, 2675)
('NW_005871048.1', 'IFESKEQTEYLVTSLQRGPFQVPLVLKLSK', 'PQL', 136, 1, 3020, 3431)
('NW_005871048.1', 'QSGSKTKLNHMLAAKIHLNYKDKVKVNGWE', 'IVH', 116, 1, 3923, 4274)

so I think that should solve the problem (apart from me being a bit confused why there was a -3 there in a first place)

ADD COMMENT
0
Entering edit mode

That's because you want to find the start; typically, the length of aa is 3 (e.g. Cysteine: TGT,TGC), thus the -3.

ADD REPLY
0
Entering edit mode

Removing -3 from the code is not the solution, because it will affect also to the other sequences (which are in principle OK). So, the first step should be to understand the meaning of this -3, and then think what to do. In this specific case, I guess that 0 is the correct start position, because the end position of the transcript is 669, and the total length of the protein is 223. So, the total length of the transcript will be 223*3 = 669 = end position.

ADD REPLY
0
Entering edit mode

Well, that's why I'm confused because putting -3 there made sense but output make more sense without it and sequences (apart from the first one) looks exactly the same.

Can it be caused by length of the contig which is only a partial codon and not a multiple of three?

ADD REPLY
0
Entering edit mode

If you look at the other sequences in the -1 strand, you'll see that if you remove -3, the start position will also change (this -3 is only affecting to the -1 strand sequences ).

ADD REPLY
0
Entering edit mode

You're right of course. As I said I'm still a bit lost about all this scripting business.

How to fix it without taking out -3 then? with "if" exceptions for -1 strands?

ADD REPLY

Login before adding your answer.

Traffic: 2114 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6