Hi everyone,
From a FASTA file, I have created a dictionary in the form : seqs = {name1 : seq1, name2 : seq2, name3 : seq3... nameN : seqN}, where name corresponds to the name of a read and seq is the read itself. In my case, N is 25, but I would like to leave it open and not limited to any number, for those cases in which the FASTA file has a different amount of reads than 25.
What I need is to: 1. Detect the open reading frames (ORFs) present in each seq. There could be sequences that do not have any ORFs 2. For the cases in which an ORF exists, it should create a new dictionary and append the nameX : ORFX pairs at the end 3. For the cases in which no ORF exists, it should create the entry in the dictionary nameY : has no ORF 4. Sort the dictionary from longest to shortest len(ORF), or from shortest to longest. It doesn't really matter how, but sorted in such way that the shortest and the longest ORF can be identified. The "has no ORF" values could go either at the beginning or the end 5. Consider starting codon ATG and closing codons TAG, TAA and TGA 6. Do all this for both strands ( 5' --> 3' and viceversa) 7. Consider reading frames 0, 1 and 2
Optional (not necessary, but it would be nice to know this):
- Create another dictionary with the form: Prots = {name1 : aminoacid sequence1, name2 : aminoacid sequence2, name3 : aminoacid sequence3,... nameN : aminoacid sequenceN}
- Besides from printing the results on the screen, store them in a .csv file too (or a .txt if the .csv
I am running Python 3 on Ubuntu 16.04 and this is what I have done:
f = open("dna2.fasta", "r")
seqs = {}
for line in f:
line = line.rstrip()
if line[0] == ">":
words = line.split()
name = words[0][1:]
seqs[name] = ""
else:
seqs[name] = seqs[name] + line
for name, seq in seqs.items():
print(name, seq)
the above lines create the dictionary
f.seek(0)
for v in sorted(seqs, key = lambda v: len(seqs[v]), reverse = True):
print(v, len(seqs[v]))
the above lines sort the dictionary according to the lengths of the values
from here on the problems begin dict_ORFs = {}
frame = input("Select the starting frame for reading the sequence: ")
if int(frame) < 3:
def has_start_codon(seqs, frame):
"This searches for the start codon ATG in the seqs dictionary, depending on the chosen reading frame"
start_codon_found = False
start_codon = ["TGA"]
for i in range(int(frame), len(seqs.values()), 3):
codon = seqs[i:i+3].upper()
if codon in start_codon:
start_codon_found = True
break
return start_codon_found
print("The start codon was found in position %s, from reading frame %s" % (i, int(frame)))
else:
print("There is no start codon. Therefore, there is no ORF at reading frame %s." % int(frame))
def has_stop_codon(seqs, frame):
"This searches for the stop codons TGA, TAG and TAA in the seqs dictionary, depending on the chosen reading frame"
stop_codon_found = False
stop_codon = ["TGA", "TAG", "TAA"]
for i in range(int(frame), len(seqs.values()), 3):
if start_codon_found = False or stop
else:
print("%s is not a valid reading frame. Please select a value between 1 and 3." % int(frame))
f.close()
Thank you guys in advance!
Is it necessary that you have all the sequences available in the dictionary all the time? With a large enough Fasta file, you could get memory problems.. (source: I am working with wheat :P)
Hi cschu1981. Well I am quite new with Python and the way how I learned it was by using dictionaries. I will be open to any new ideas...
Any particular reason it has to be dictionary based (or python?)
Hi jrj.healey. I am quite new with Python and I thought dictionaries were a good option to work with sequences from a FASTA file, as you can have the name as a key and the sequence as a value in a quite organised way. It should be Python because that is the programming language that I have started to work with. What would you propose instead?
Can you recheck your indentation? Looks really odd in the bottom part, where you have an
if
statement containing a function declaration.Hi WouterDeCoster. Well if you see that it is quite odd, then there is a high chance that you are right and that I did something wrong. I am learning Python, so my level is still quite low and therefore I also guess that my scripts can not only be corrected, but improved or made more efficient. All the last part (starting with "frame = "... until "f.close()") has been a headache, so please feel free to edit it in the way you think it should work. I have already been working on it for a couple of weeks and don't achieve anything better :(
Learning a programming language is hard. Definitely, if you do it on your own. Python is quite friendly and the learning curve isn't too steep, but it's still tough. And you didn't choose an easy problem to start with ;-)
An interesting resource to get better at programming is Rosalind. You'll get biological problems of increasing difficulty.
For working with fasta files it would be good to work with Biopython, for which there is an excellent tutorial Using functions from Biopython SeqIO will avoid awkward constructions such as:
I'll have a look at your code later, tonight or tomorrow (kinda busy at the moment).
Hi WouterDeCoster. Thanks a lot for your corrections, suggestions and recommendations. I will take a look at Rosalind for sure (I didn't know about it before). I am currently taking a course from Coursera (Python for Genomic Data Science) and part of what I have posted belongs to a task. However, I want to make my program a bit more powerful and decided to add more things to it, for the cases in which the amount of sequences will be different, as well as the lengths of each read and so on. I appreciate your support very much and I look forward to your ideas. For now, I will apply the corrected indentations. Thanks again ;-)
I agree that dictionaries are awesome, but I'm not sure they are the most appropriate data container here. A dictionary is great if you are processing some values and want to store them for accessing later through a keyword. Almost everything of your objective can be achieved by checking the sequences one by one, such as identification of ORFs, printing to screen and saving to a csv file. A nice side effect of this is that you need far less memory. Just the sorting of the sequences requires that you keep all information in memory to compare the lengths, but I don't know how important that feature is for you.
There are plenty of amazing tutorials and resources available. Don't be too hard on yourself and try to make a difficult problem work out, such as this script you are trying to write. You can aim lower and gradually, more difficult tasks will become obvious. Happy coding!
Thanks for your advice again. I am trying to work in my script and starting it from the simplest tasks and then adding more complicated ones. Unfortunately, I still can't achieve it :-(
What could be a better way to work instead of a dictionary? Do you have any other suggestions that I could implement in my script?
It sounds like you basically have 2 or 3 actual tasks to achieve:
1 - You need a function that takes a fasta as an input and gives you back any ORFs it finds.
Here is a link where someone has some ORF finding code, you might be able to start here: Python- Find ORF in sequence, compound return statement
2 - You need to handle fasta manipulation, so you'd probably be best of looking at Biopythions SeqIO for this. You can very easily iterate over all the fasta sequences in a file and access object info such as the record name and sequence as 'seperate entities' and then rejoin them in your output file however you wish.
Once you open your file, iterate each fasta record in turn, apply your ORF finder function to it, then output the record name, the number of ORFs, or the ORFs themselves.
I don't know whether you HAVE to have all the fastas in memory at the same time or not, or you are happy to write out to a file sequentially. If you want them in memory, you could append them each to a list, and print out the list at the end of the code, or, every time a fasta is processed, write that entry to an output file in append mode.
In general, having the sequences in a dictionary is not wrong, it can just cause problems if you have too much sequence data that fill up your memory. For the ORF prediction there is no real reason to have all the sequences available at once (i.e. in a dictionary), since you will be working on one sequence at a time and what you predict in one sequence does not have any influence on what happens with the other sequences. Thus, it does not really make sense to read them all into memory (assuming you're not doing parallel processing). What I usually do is to use the function
readFasta()
from https://github.com/cschu/ktoolu/blob/master/ktoolu_io.py (shameless self-advertise). This function will read a Fasta file and return a generator to its contents (which means, you can iterate through the file, obtaining one sequence record (id + sequence) at a time and do the processing one-by-one.) Maybe that is the same thing as the i/o functions from biopython do, I don't know. It works for me.An example would be: