Help with Biopython for Beginner
1
0
Entering edit mode
7 months ago
cput • 0

Hello! I am an absolute beginner to python, attempting to learn it for genomics purposes, and I’ve been self-teaching through an online course. From the course, and many, many other examples of the internet I have made the below mess of a code. It works perfectly until the has_start_codon part. I have been working through Visual Studio Code’s python extension. I am attempting to answer these questions as these are the last ones I haven’t been able to solve with my program: “Identify all ORFs present in each sequence of the FASTA. What is the length of the longest ORF? What is the identifier of the longest ORF. For a given identifier, what is the longest ORF contained in that sequence? What is the starting position of the longest ORF in that identified sequence? Idenfity all repeats in a sequence for all sequences in the FASTA, along with how many times each repeat occurs and which is the most frequent repeat.”

The primary problem I think I have, is that I don’t know how to reference the sequences inside a FASTA file beyond what I have already, so my has_codon section of code isn’t working like I think it should be, and the last section (findlongestrepeat) I understand even less. Similar to my first section where I use “dna” to reference a line of code inputting into the program, I assumed “sequence” would direct to the lines of sequence within the FASTA file, but clearly I’m wrong about that. I've also been trying to get it to read different Open Reference Frames through a similar method as the findlongestrepeat, but that's not working out either and I have gone through different versions that I have scrapped to try and start over fresh. I’ve been stuck on this section for a week, and have tried many, MANY different iterations and different methods to answer the questions that I know but none have worked for me and the course I have been learning from is vague on working from a FASTA file on these points. Any tips on how to get the program to read open reference frames (ORF) and repeats from a FASTA file would be much appreciated.

from Bio.Seq import Seq
from Bio import SeqIO
import re

dna=Seq(input('Enter DNA sequence:'))
print(dna)
print(dna.complement())
print(dna.reverse_complement())
print(dna.translate())
def gc(dna):
    nbases=dna.count('n')+dna.count('N')
    gcpercent=float(dna.count('c')+dna.count('C')+dna.count('g')+dna.count('G'))*100/(len(dna)-nbases)
    return gcpercent
print(gc(dna))
pos=dna.find('GT',0)
while pos>-1:
    print("Donor splice site candidate at position %d"%pos)
    pos=dna.find('GT',pos+1)

for sequence in SeqIO.parse(input('Enter FASTA File here:'), "fasta"):
    from Bio.Seq import Seq
    print(sequence.id)
    print(repr(sequence.seq))
    print(len(sequence))
def has_start_codon(sequence,frame=0):
    start_codon_found=False
    start_codon=['ATG','atg']
    for i in range(frame,len(sequence),3):
        codon=sequence[i:i+3].lower()
        if codon in start_codon:
            start_codon_found=True
            print(start_codon_found)
            break
    return start_codon_found
def has_stop_codon(sequence,frame=0):
    stop_codon_found=False
    stop_codons=['tga','tag','taa','TGA','TAG','TAA']
    for i in range(frame,len(sequence),3):
        codon=sequence[i:i+3].lower()
        if codon in stop_codons:
            stop_codon_found=True
            print(stop_codon_found)
            break
        return stop_codon_found

import string
import sys

def findLongestRepeat(text):
    max = 1
    maxPos = -1
    maxDup = -1
    for pos in range(len(text)):
        dup = text.find(text[pos:pos+max], pos+1, len(text))
    while (dup > 0):
        maxPos = pos
        maxDup = dup
        max = max + 1
        dup = text.find(text[pos:pos+max], dup, len(text))
    return [maxPos, maxDup, max-1]
if (len(sys.argv) != 2):
    print("Usage: python", sys.argv[0], "<filename>")
else:
    text = sequence.readFastaFile(sys.argv[1])
    [pos, dup, ln] = findLongestRepeat(text)
    print("Found duplicate of length", ln)
    print(pos, text[pos:pos+ln])
    print(dup, text[dup:dup+ln])
Python ORF FASTA Biopython • 992 views
ADD COMMENT
1
Entering edit mode

Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.

code_formatting

ADD REPLY
0
Entering edit mode

That turns it into a mess of paragraphs, but if it truly is more productive thanks for the tip!

ADD REPLY
0
Entering edit mode

That turns it into a mess of paragraphs

See above. Select (highlight with mouse) the part you want to represent as code and then click the 101010 button in the editor. Follow by save. Voila you are done.

ADD REPLY
0
Entering edit mode

One question before offering more solutions: are you trying to learn Python with this as an example problem, or are you trying to find the best way to approach this?

I ask because with the former, you can write verbose functions to try and predict ORFs 'manually' but if you want to learn BioPython properly, you can lean in to existing functions and tools.

ADD REPLY
0
Entering edit mode
6 months ago
`for record in SeqIO.parse(file_path, "fasta"):
    seq_id = record.id
    sequence = str(record.seq)
    orfs = find_orfs(sequence)
file_path = "path_to_your_fasta_file.fasta"

`

  • You have given the input as individual sequence. You should give the whole file and loop it for individual sequences for whatever you wish to analyze from the list of sequences in the given file
ADD COMMENT
0
Entering edit mode

Also you can create a function for finding the complement reverse and protein sequence and then return the values from it

ADD REPLY

Login before adding your answer.

Traffic: 2091 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6