Finding ORFs in a FASTA file
0
0
Entering edit mode
6 weeks ago
Melissa • 0

I really don't know what I am doing. My program is above me. I have zero computer science background and the program is from the POV that everyone has one. I have a project where I have to take sequences from a FASTA file and look for ORFs. Once I find them, I need to score then according to a matrix I was given and only select:

  1. x >7.25 as quality score cutoff
  2. 60 Nucleotides coding for amino acids as min ORF length

My code works, but it gives me 1 result from the 19 entries from the FASTA file we were provided for the project. The questions we are asked after running our code suggest I should have more than 1 result. I don't know what I am doing wrong. I put in debugging codes to check the scores and length, and they support having 1 result (from what I can tell). Would you mind looking at this and letting me know what I am doing wrong? TIA.

from Bio import SeqIO

# Build a dictionary that allocates DNA bases to specific indexes
base_idx = { 'A' : 0, 'T' : 1, 'C' : 2, 'G' : 3 } # letter matching the index for motif scoring from motifCode.txt

# Initialize the motif scoring matrix from motifCode.txt
motif = [[.5,.5,.5,.5,0,0,0,0,0,2,-99,-99,.5],   #A
            [0,0,0,0,0,0,0,0,0,-99,2,-99,0],        #T
            [0,0,0,0,0,0,0,0,0,-99,-99,-99,0],      #C
            [.5,.5,.5,.5,0,0,0,0,0,.5,-99,2,0]]     #G

# Define a function that takes in a 13 bp sequence and returns a motif score
def scoreMotif(sequence):
    """
    Scores sequences that are 13 basepairs long using the motif matrix and returns a motif score.
    """
    if len(sequence) !=13:
        raise ValueError('Sequence length does not meet the required parameters.') # raise an error for invalid lengths

    score = 0 # initialize the score for each sequence
    # Iterate through the sequence to calculate the motif score, adding scores for each base at the matched indexes
    for i in range(13): 
        base = sequence[i] # get current base at the defined index 
        if base in base_idx: # checks if the bases in the sequence is in base_idx
            score += motif[base_idx[base]][i] # get the cumulative score

    return score

# Define a function that takes in a sequence and returns an array of sequences, corresponding ORF start positions, and ORF lengths
def scanSeq(sequence):
    """
    Searches a single sequence for ORFs and returns the start positions, lengths, and sequences.
    """
    # Initialize the start and stop codons, and define the requirements outlined in the activity guidelines
    start_codon = {'ATG', 'GTG'}
    stop_codon = {'TAA', 'TAG', 'TGA'}
    min_ORF = 60 # the required minimum ORF length 
    quality_score = 7.25 

    ORFs = []

    for i in range(len(sequence)-2): # iterate through the sequence from codon to codon, -2 maintains the codon triplet
        if sequence[i:i+3] in start_codon: # Find a start codon and search for a stop codon
            for j in range(i+3, len(sequence)-2, 3): # +3 maintains the codon triplet
                if sequence [j:j+3] in stop_codon:
                    ORF_sequence = sequence[i:j+3] # the length are the sequences between the start and stop codons

                    # Apply the required quality score using the scoreMotif function to the first 13 sequences to find the ORF length
                    if len(ORF_sequence) >=min_ORF:
                        score = scoreMotif(ORF_sequence[:13])
                        if score > quality_score:
                            ORFs.append((i, len(ORF_sequence), ORF_sequence)) # append the position, ORF sequence, and corresponding length

    return ORFs

# Define a function that takes in an input file and outputs a FASTA file that identifies what contig the sequences came from
def identifyORFs(input_file, output_file):
    """
    Performs the program's goal: finds the most likely coding region.
    """
    with open(output_file, 'w') as out_f:
        for record in SeqIO.parse(input_file, 'fasta'): # from the Biopython library, handling DNA sequences, reads and parses FASTA files
            # Get the contig identifier for the FASTA header 
            if "|" in record.id:
                contig_ID = record.id.split('_')[1].split('|')[0] # parse out the number if header doesn't end at a number identifier
            else:
                contig_ID = record.id.split('_')[1] # parse out the number after the underscore if header ends at a number identifer

            # Get the contig sequence and ORF sequence    
            sequence = str(record.seq)
            ORFs = scanSeq(sequence) # executes scanSeq to get the ORFs in the sequence

            for pos, length, ORF_sequence in ORFs:
                out_f.write(f"> contig {contig_ID}|Length {length}|at position {pos}\n") # formats FASTA header
                # out_f.write(f"{ORF_sequence}\n") # ORF sequence

                # Format the output file so that it confirms to FASTA standards of 60 characters
                for i in range (0, len(ORF_sequence), 60):
                    out_f.write(ORF_sequence[i:i+60] + "\n") # ORF sequence 

if __name__ == '__main__':
    input_file = "/Users/admin/Desktop/spaceSeq.fa"  # defines path to input file (FASTA)
    output_file = "/Users/admin/Desktop/name.fa"  # defines path to output FASTA file
    identifyORFs(input_file, output_file)
    print(f"ORFs have been successfully identified and written as a FASTA file") # success message to confirm execution
MatrixScore ORFs Motifs MotifScoring • 763 views
ADD COMMENT
1
Entering edit mode

Do you have to write (or use) your own script to get the ORFs? If not, there are far better options around that will do this for you.

Perhaps one remark on this code already: I have the impression you're not scanning all possible frames in your input sequences (eg. the comment on this line : for i in range(len(sequence)-2): makes no sense, or is even wrong) . You are for instance not processing the reverse strand == you're missing 3 "frames" completely)

ADD REPLY
0
Entering edit mode

Good point with the reverse strand, it isn't clear from the questions if scanning the reverse strand was part of the task. The comment on that line looks like something ChatGPT would write.

ADD REPLY
0
Entering edit mode

It's my comment. Codons are groups of 3 bases. My aim was to scan the sequence in triplets. I don't know how this would be done using an indexed system. I wanted to enter a loop to search in triplets, one codon at a time.

ADD REPLY
0
Entering edit mode

We have to do everything on our own. I spoke with my professor and it was an issue of not scoring appropriately. I was supposed to score the sequence upstream of the codons to find the best score, which indicates it's a motif. Then I needed to find a start and stop codon to find the ORF. I was scoring the start codons instead. So I fixed that aspect, redesigning the program again from scratch, revisiting my logic. But I still only get 1 stupid result. The problem (as I mentioned below) is that I literally don't know what I am doing. I am a wet lab cancer biologist. So something like what you mentioned, I would not pick up on. But thank you for identifying one of the problem spots. Unfortunately, I do not know how to fix it. DNA is read in 6 frames, 3 for the reverse and 3 for the complementary. That was what I was trying to accomplish in the code. But I spoke with my professor and he said normally you would do that, but he made sure to have a matrix and sequence that worked without having to get them. So I omitted that part of the code in my edit.

ADD REPLY
0
Entering edit mode

The problem (as I mentioned below) is that I literally don't know what I am doing.

Kudos for the honesty :) However, that is true for most people and even more for those who won't admit it.

I think the first task here would be to generate a few example sequences yourself and then see if they are detected by your code.

Edit nope: that was wrong But seriously, even now with the changes in the task, you can simplify your scoring by simply searching for [AG]{4,4}.{5,5}ATG in the input sequence, because neither your scoring matrix nor the threshold changed. Search your input sequence with this regex and see how many hits there are, then look for ORFs.

ADD REPLY
0
Entering edit mode

Thank you everyone. As I said, it doesn't make sense to me either. I'm a wet lab cancer researcher. I'm not a coder. I formulate the work flow logically and need help translating it into code. I reached out to my teachers and they told me one place where the logic didn't make sense. So, I restructured everything:

from Bio import SeqIO

# Build a dictionary that allocates DNA bases to specific indexes
base_idx = { 'A' : 0, 'T' : 1, 'C' : 2, 'G' : 3 } # letter matching the index for motif scoring from motifCode.txt

# Initialize the motif scoring matrix from motifCode.txt
motif = [[.5,.5,.5,.5,0,0,0,0,0,2,-99,-99,.5],   #A is scored a 2 and i = 9
         [0,0,0,0,0,0,0,0,0,-99,2,-99,0],        #T is scored a 2 at i = 11
         [0,0,0,0,0,0,0,0,0,-99,-99,-99,0],      #C is scored a 0 at all but when i = 10-12
         [.5,.5,.5,.5,0,0,0,0,0,.5,-99,2,0]]     #G is scored a 2 at i = 12       

# Define a function that takes in a 13 bp sequence and returns a motif score
def scoreMotif(sequence):
    """
    Scores sequences that are 13 basepairs long using the motif matrix and returns a motif score.

    Parameters:
    sequence (str): DNA sequences 13 basepairs long.

    Returns:
    list: motif scores for a 13-base window along each DNA sequence.
    """
    scores = []  # stores scores for each 13-base window    

    # Scan through both contig to score each 13 base window (slidding across the strand 1 base at a time)
    for i in range(len(sequence) - 12):  # iterates over a 13 base window
                subsequence_window = sequence[i:i + 13] # define the window so it slides across every 13 bases
                score = 0 # initialize the score each time
                # Iterate through the sequence to calculate the motif score, adding scores within each window
                for j in range(13): 
                    base = subsequence_window[j] # get current base at the defined index and iterate over each position
                    if base in base_idx: # checks if the base is valid by matching it to the base_idx
                        score += motif[base_idx[base]][j] # get the cumulative score
                scores.append(score)

    return scores

# Define a function that takes in a sequence and returns an array of sequences, corresponding ORF start positions, and ORF lengths
def scanSeq(sequence):
    """
    Searches a single sequence for ORFs and returns the start positions, lengths, and sequences.

    Parameters:
    sequence (str): DNA sequences.

    Returns:
    list (tuples): an ORF sequence, the length, and the position.
    """
    # Initialize the start and stop codons, and define the requirements outlined in the activity guidelines
    start_codon = {'ATG', 'GTG'}
    stop_codon = {'TAA', 'TAG', 'TGA'}
    min_ORF = 60 # the required minimum ORF length 
    quality_score = 7.25 

    ORFs = [] # empty set to store ORFs

    # Scan for motifs in every 13-base window to take sequences that score above 7.25
    high_score = [] # empty set to store high scoring motifs
    for i in range (len(sequence) - 12): # the motif must be 13 bases long
            subsequence_window = sequence[i:i+13] # take the 13 base window containing the possible motif  
            total_score = sum(scoreMotif(subsequence_window)) # find your total motif score from indices

            if total_score > quality_score: # if the motif score is higher then 7.25
                high_score.append(i) # store high scoring motifs 

    # Scan for ORFs that follow high-scoring start positions
    for pos in high_score:  # iterate through entire sequence to find a start codon
        if sequence[pos:pos+3] not in start_codon: # if a start codon is not found, keep searching (had to do this to start at a valid start)
            continue
        j = pos + 3  # move along the sequence in triplets to find stop codons

        # Search for all possible stop codons in triplets after the start codon
        while j < len(sequence) - 2:
            if sequence[j:j + 3] in stop_codon:  # if a stop codon is found within this window
                ORF_sequence = sequence[pos:j + 3]  # take this ORF

                if len(ORF_sequence) >= min_ORF:  # if the ORF lengths are at least 60 nucleotides long
                    ORFs.append((pos, len(ORF_sequence), ORF_sequence))  # store this ORF, its position and length

            j += 3  # continue searching for the next stop codon to find all possible stop codons before re-entering the loop

    return ORFs

# Define a function that takes in an input file and outputs a FASTA file that identifies what contig the sequences came from
def identifyORFs(input_file, output_file):
    """
    Performs the program's goal: finds the most likely coding region.

    Parameters:
    input_file (str): FASTA file containing DNA sequences.
    output_file (str): FASTA file identifying ORFs, their length, and position.

    Returns:
    None
    """
    with open(output_file, 'w') as out_f:
        for record in SeqIO.parse(input_file, 'fasta'): # from the Biopython library, handling DNA sequences, reads and parses FASTA files
            # Get the contig identifier for the FASTA header 
            if "|" in record.id:
                contig_ID = record.id.split('_')[1].split('|')[0] # parse out the number if header doesn't end at a number identifier
            else:
                contig_ID = record.id.split('_')[1] # parse out the number after the underscore if header ends at a number identifer

            # Get the contig sequence and ORF sequence    
            sequence = str(record.seq)
            ORFs = scanSeq(sequence) # executes scanSeq to get the ORFs in the sequence

            for start_pos, length, ORF_sequence in ORFs:
                out_f.write(f"> contig {contig_ID}|Length {length}|at position {start_pos}\n") # formats FASTA header

                # Format the output file so that it confirms to FASTA standards of 60 characters
                for i in range (0, len(ORF_sequence), 60):
                    out_f.write(ORF_sequence[i:i+60] + "\n") # ORF sequence 

if __name__ == '__main__':
    input_file = ""  # defines path to input file (FASTA)
    output_file = ""  # defines path to output FASTA file
    identifyORFs(input_file, output_file)
    print(f"ORFs have been successfully identified and written as a FASTA file") # success message to confirm execution

I still have issues though. I get 1 single result. When I remove this: if sequence[pos:pos+3] not in start_codon: # if a start codon is not found, keep searching (had to do this to start at a valid start) from my scanSeq function, I get more results. But they do not begin at a start codon.

ADD REPLY
0
Entering edit mode

Two suggestions. Please use ADD REPLY/ADD COMMENT when responding to existing answers/comments. SUBMIT ANSWER is for new answers to original question.

You can format the code portion of your post by highlighting it and then clicking 101010 button on edit window. I have done this for post above.

ADD REPLY

Login before adding your answer.

Traffic: 2641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6