I really don't know what I am doing. My program is above me. I have zero computer science background and the program is from the POV that everyone has one. I have a project where I have to take sequences from a FASTA file and look for ORFs. Once I find them, I need to score then according to a matrix I was given and only select:
- x >7.25 as quality score cutoff
- 60 Nucleotides coding for amino acids as min ORF length
My code works, but it gives me 1 result from the 19 entries from the FASTA file we were provided for the project. The questions we are asked after running our code suggest I should have more than 1 result. I don't know what I am doing wrong. I put in debugging codes to check the scores and length, and they support having 1 result (from what I can tell). Would you mind looking at this and letting me know what I am doing wrong? TIA.
from Bio import SeqIO
# Build a dictionary that allocates DNA bases to specific indexes
base_idx = { 'A' : 0, 'T' : 1, 'C' : 2, 'G' : 3 } # letter matching the index for motif scoring from motifCode.txt
# Initialize the motif scoring matrix from motifCode.txt
motif = [[.5,.5,.5,.5,0,0,0,0,0,2,-99,-99,.5], #A
[0,0,0,0,0,0,0,0,0,-99,2,-99,0], #T
[0,0,0,0,0,0,0,0,0,-99,-99,-99,0], #C
[.5,.5,.5,.5,0,0,0,0,0,.5,-99,2,0]] #G
# Define a function that takes in a 13 bp sequence and returns a motif score
def scoreMotif(sequence):
"""
Scores sequences that are 13 basepairs long using the motif matrix and returns a motif score.
"""
if len(sequence) !=13:
raise ValueError('Sequence length does not meet the required parameters.') # raise an error for invalid lengths
score = 0 # initialize the score for each sequence
# Iterate through the sequence to calculate the motif score, adding scores for each base at the matched indexes
for i in range(13):
base = sequence[i] # get current base at the defined index
if base in base_idx: # checks if the bases in the sequence is in base_idx
score += motif[base_idx[base]][i] # get the cumulative score
return score
# Define a function that takes in a sequence and returns an array of sequences, corresponding ORF start positions, and ORF lengths
def scanSeq(sequence):
"""
Searches a single sequence for ORFs and returns the start positions, lengths, and sequences.
"""
# Initialize the start and stop codons, and define the requirements outlined in the activity guidelines
start_codon = {'ATG', 'GTG'}
stop_codon = {'TAA', 'TAG', 'TGA'}
min_ORF = 60 # the required minimum ORF length
quality_score = 7.25
ORFs = []
for i in range(len(sequence)-2): # iterate through the sequence from codon to codon, -2 maintains the codon triplet
if sequence[i:i+3] in start_codon: # Find a start codon and search for a stop codon
for j in range(i+3, len(sequence)-2, 3): # +3 maintains the codon triplet
if sequence [j:j+3] in stop_codon:
ORF_sequence = sequence[i:j+3] # the length are the sequences between the start and stop codons
# Apply the required quality score using the scoreMotif function to the first 13 sequences to find the ORF length
if len(ORF_sequence) >=min_ORF:
score = scoreMotif(ORF_sequence[:13])
if score > quality_score:
ORFs.append((i, len(ORF_sequence), ORF_sequence)) # append the position, ORF sequence, and corresponding length
return ORFs
# Define a function that takes in an input file and outputs a FASTA file that identifies what contig the sequences came from
def identifyORFs(input_file, output_file):
"""
Performs the program's goal: finds the most likely coding region.
"""
with open(output_file, 'w') as out_f:
for record in SeqIO.parse(input_file, 'fasta'): # from the Biopython library, handling DNA sequences, reads and parses FASTA files
# Get the contig identifier for the FASTA header
if "|" in record.id:
contig_ID = record.id.split('_')[1].split('|')[0] # parse out the number if header doesn't end at a number identifier
else:
contig_ID = record.id.split('_')[1] # parse out the number after the underscore if header ends at a number identifer
# Get the contig sequence and ORF sequence
sequence = str(record.seq)
ORFs = scanSeq(sequence) # executes scanSeq to get the ORFs in the sequence
for pos, length, ORF_sequence in ORFs:
out_f.write(f"> contig {contig_ID}|Length {length}|at position {pos}\n") # formats FASTA header
# out_f.write(f"{ORF_sequence}\n") # ORF sequence
# Format the output file so that it confirms to FASTA standards of 60 characters
for i in range (0, len(ORF_sequence), 60):
out_f.write(ORF_sequence[i:i+60] + "\n") # ORF sequence
if __name__ == '__main__':
input_file = "/Users/admin/Desktop/spaceSeq.fa" # defines path to input file (FASTA)
output_file = "/Users/admin/Desktop/name.fa" # defines path to output FASTA file
identifyORFs(input_file, output_file)
print(f"ORFs have been successfully identified and written as a FASTA file") # success message to confirm execution
Do you have to write (or use) your own script to get the ORFs? If not, there are far better options around that will do this for you.
Perhaps one remark on this code already: I have the impression you're not scanning all possible frames in your input sequences (eg. the comment on this line :
for i in range(len(sequence)-2):
makes no sense, or is even wrong) . You are for instance not processing the reverse strand == you're missing 3 "frames" completely)Good point with the reverse strand, it isn't clear from the questions if scanning the reverse strand was part of the task. The comment on that line looks like something ChatGPT would write.
It's my comment. Codons are groups of 3 bases. My aim was to scan the sequence in triplets. I don't know how this would be done using an indexed system. I wanted to enter a loop to search in triplets, one codon at a time.
We have to do everything on our own. I spoke with my professor and it was an issue of not scoring appropriately. I was supposed to score the sequence upstream of the codons to find the best score, which indicates it's a motif. Then I needed to find a start and stop codon to find the ORF. I was scoring the start codons instead. So I fixed that aspect, redesigning the program again from scratch, revisiting my logic. But I still only get 1 stupid result. The problem (as I mentioned below) is that I literally don't know what I am doing. I am a wet lab cancer biologist. So something like what you mentioned, I would not pick up on. But thank you for identifying one of the problem spots. Unfortunately, I do not know how to fix it. DNA is read in 6 frames, 3 for the reverse and 3 for the complementary. That was what I was trying to accomplish in the code. But I spoke with my professor and he said normally you would do that, but he made sure to have a matrix and sequence that worked without having to get them. So I omitted that part of the code in my edit.
Kudos for the honesty :) However, that is true for most people and even more for those who won't admit it.
I think the first task here would be to generate a few example sequences yourself and then see if they are detected by your code.
Edit nope: that was wrong But seriously, even now with the changes in the task, you can simplify your scoring by simply searching for
[AG]{4,4}.{5,5}ATG
in the input sequence, because neither your scoring matrix nor the threshold changed. Search your input sequence with this regex and see how many hits there are, then look for ORFs.Thank you everyone. As I said, it doesn't make sense to me either. I'm a wet lab cancer researcher. I'm not a coder. I formulate the work flow logically and need help translating it into code. I reached out to my teachers and they told me one place where the logic didn't make sense. So, I restructured everything:
I still have issues though. I get 1 single result. When I remove this: if sequence[pos:pos+3] not in start_codon: # if a start codon is not found, keep searching (had to do this to start at a valid start) from my scanSeq function, I get more results. But they do not begin at a start codon.
Two suggestions. Please use
ADD REPLY/ADD COMMENT
when responding to existing answers/comments.SUBMIT ANSWER
is for new answers to original question.You can format the
code
portion of your post by highlighting it and then clicking101010
button on edit window. I have done this for post above.