Hello,
I want to obtain DNA sequences of all the CDS from multiple genbank files in one fasta file. I tried several solutions with Biopython but nothing is working for me.
I tried for example the following code:
from Bio import SeqIO
genome = next(SeqIO.parse("NC_018019.2.gb", "genbank"))
cds = []
for feature in genome.features:
if feature.type == 'CDS':
cds.append(feature)
seqs = []
for feature in cds:
seq = genome[int(feature.location.start):int(feature.location.end)]
if feature.location.strand == -1:
seq.seq = seq.seq.reverse_complement()
seq.description = feature.qualifiers['product'][0]
seqs.append(seq)
SeqIO.write(seqs, 'CDS.fasta', 'fasta')
But on line 4 I got the error IndentationError: expected an indented block after 'if' statement on line 2.
I am new with python, so I don't know how to make it works.
Also this solution was quite old and I am working on Python 3.12.0 and Biopython 1.81 (can get also 1.76).
Maybe this is an issue of versions.
Here is an example of genbank file:
Ideally, I would like to obtain all the DNA sequences from all the genbank files for all the CDS and another for all the "integrase".
Thank you very much for your help.
Hello, Thank you very much for your quick reply. That was indeed the issue!
Another question: is there a simple way of automating this script so that it applies to all gbk files?
Look into passing command line parameters to a python script. You can pass the input file name as a parameter to the script and execute the script from the command line. You'd also need to set the output file name dynamically (using the input file name to derive it, for example).
An alternative is to wrap the code in a function and call the function using a simple loop or list comprehension.
Thanks again for your reply. I'm not used to python so I don't really know how to do that, but I will search how to put the input file name as a paremeter.
Here's a hint and some pseudocode. There are more robust ways of doing it but this is a quick and dirty solution.
Thank you, I see it just now
I tried adding this part to my code with also
But for the next step of my script I've got the following error "NameError: name 'genome' is not defined
I suggest you don't use
parse
. You are making life complicated for yourself with the use ofnext
etc.Have a look at the biopython manual to understand the difference.
Okay. I have tried it also, but this time I changed the way to call the script. In the terminal I write:
My script is the following:
but I get the following error:
Sorry, I'm not really used to python. This is the first time I've used it because I haven't found any other solution for recovering nucleic acid CDS sequences.
You are passing it a directory, when it is expecting a list of filenames. Hence the error is telling you you've given it a directory.
Run the code as
python scriptname.py /path/to/files/*.gbk
insteadThanks a lot. I thought my script already took the gbk files and I just had to specify a directory. Anyway, it works now. Only one last problem: my last line to get a file with all the sequences is probably wrong. I only got the CDS sequences from the last .gbk file in my directory.
I have found scripts to create different files depending on each .gbk file, but in my case I want a single final fasta file with all the sequences from all the .gbk files. Is this possible?
You are defining the
seqs
list during the loop, so it is overwritten each time the loop runs (once per file). You need to append the sequences to the list inside the loop as you currently do, but you need to declare the list just once outside the loop.Then the list will be appended with the CDSs from each file.
Thanks Sorry, but again I'm not used to python, so I'm not sure I understand your suggestion. Do you mean that I should put the command line "
SeqIO.write(seqs, 'test_all_gbk.fasta', 'fasta')
" inside the loop? I've tried this option but to no success. I must have forgotten some of the steps. And how do I "declare the list" outside the loop?An other issue I found today: my script is not working for all cds sequences. Actually it takes only the "complement" ones and not the others. I guess there is something to do with the line "
seq.seq = seq.seq.reverse_complement()
", however, I don't know how to allow both (complement and 'normal' sequences).Sorry, It was actually quite easy, I added an "else" line.
I've tried to adapt a solution to my script, which is as follows:
but I can't get any sequences
We can't help you debug the script if you don't tell us what the actual problem is. You need to share the exact errors.
After the loop (for file_...) I get the following error:
That doesn't add up - are you sure the code you've pasted here matches the code you're running?
You don't use the variable
files
anywhere in that script, so naturally, it is not defined.I think you need to proof-read your script very closely and understand what each element is trying to do. You've also posted 2 different versions for which you're now getting 2 different errors, so be careful you aren't getting mixed up.