I have combined gene bank file of polymerase coding genes from all virus family, how I can extract gene IDs and the corresponding end...start position?
Perhaps one of the many other all.*.tar.gz files @ ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ would be easier to parse and/or could even give you exactly what you want without any parsing..
Did you try something out - I mean this is basic string extraction and stackoverflow is overloaded with these questions already. Might help to know what has been tried and model of output is expected
You can use BioPython for that. But don't merge GB files, keep them individual and read them one by one using BioPython and extract the information. Check out below example code for NC_005816:
from Bio import SeqIO
record = SeqIO.read('NC_005816.gb', 'genbank')
for feature in record.features:
if feature.type == 'gene':
print 'Location:', feature.location
print 'Locus tag:', feature.qualifiers['locus_tag']
Output:
Location: [86:1109](+)
Locus tag: ['YP_RS22210']
Location: [1108:1888](+)
Locus tag: ['YP_RS22215']
Location: [2924:3119](+)
Locus tag: ['YP_RS22220']
Location: [4354:4780](+)
Locus tag: ['YP_RS22225']
Location: [4814:5888](-)
Locus tag: ['YP_RS22230']
Location: [6115:6421](+)
Locus tag: ['YP_RS22235']
Location: [6663:7602](+)
Locus tag: ['YP_RS22240']
Location: [7788:8088](-)
Locus tag: ['YP_RS22245']
Location: [8087:8429](-)
Locus tag: ['YP_RS22250']
Location: [9131:9347](-)
Locus tag: ['YP_RS22255']
Thanks, this is helpful. I'm not sure but SeqIO might also read multi-genbank format. But she might also download them again (very small files although I don't know how many she has).
The community is willing to offer guidance, but our time is limited, too. Many of your questions are variations on "I don't know how to do this - help!" - which is okay for the first few questions, but becomes increasingly tiresome with repetition. At some point, you need to invest the time and develop the skills to actually perform some of the analysis on your own.
He's basically given you the code! Based on the operating system you use, Google "biopython installation os_name" (where os_name is your OS name), and you'll get instructions. It takes 5 minutes to get biopython installed.
It is great to see that you have been making tireless efforts to learn new things/solve problems for past few months but hopefully your future mentor knows that you are "learning" and are not an expert. If that is not the case it will get you in trouble down the road. It is always good to be upfront about things you know and things you don't.
I hope your assignments so far have been test of "are you able to solve certain problems on your own" as opposed to depth of your knowledge about bioinformatics/informatics.
i should accept this reality that cant be a bioinfomatician in an
evening
That is certainly correct. No way to change that.
This is an odd way of challenging a prospective candidate before accepting them as a student. If she knows you can't program (I assume so, otherwise you would not be asking this question in first place) why did she ask you to do this? Was the idea to propose just pseudo-code (and not actual working code)?
Hang in there since the deadline has not passed as yet. I hope this is a (unfortunately rather cruel) way of making you realize what you need to learn, should you decide to start the PhD program.
she asked me to find the exact position of RNA dependent RNA polymerase genes in genomes of single stranded RNA viruses, I have an ultimatum might be until Friday evening to complete my task
Why did you not use the simple solution that @Harold.smith.tarheel had posted in the other thread. It would require manual work to go through the files but at least you would know what exactly you did. While the code posted in this thread will work you would not be able to answer any questions about it.
i was going to extract the start ... end position but at noon she told these from all virus families and I should perform only for single stranded RNA viruses
Do a separate search for accession numbers of ssRNA viruses and then just select those entries from your larger list (or you could re-do @harold's search and restrict it to ssRNA using advanced search options at NCBI?). You will need to grab accession number but if you have the files it should be a quick re-run of the command you already used.
this is her new question from yesterday, I am new to this challenge... already she asked for miRNAseq, exomseq, and many others, how I can be professional in all
I'm pretty sure people ask for proof of experience - existing code perhaps. Is that the case? Either way, if you have to check with us for basic text processing, your mentor should definitely know this difficulty you have. You've been here for a while now and have progressed a lot - I don't see how you have difficulties with this simple task.
Sorry if this is rude, but you have to let go of self pity and face reality - your mentor and you have a HUGE communication gap. And then, seek actual useful help, not band-aid solutions that don't address the core problems.
Agreed, the mentor is probably trying to evaluate the OP's skill level with this task. I understand that the people on the forum are trying to be helpful but, by writing and troubleshooting the code, you're actually undermining the objective of the exercise. Is it really in the OP's best interest to pretend to possess a set of skills that are lacking?
And it's not a question of becoming "a bioinformatician in an evening". The OP has been a member of this community for nearly two years, has more than 1000 posts, and still has trouble performing the most rudimentary tasks. This type of question (where do I find XX database, how do I download YY data, and how do I parse it for ZZ information?) has been asked and answered multiple times (e.g., see this post from the OP 18 months ago). Yet there seems to be little ability to generalize the question and apply any prior knowledge, or to interpret/modify/troubleshoot the code that's offered.
I realize that this assessment sounds harsh, but it's not intended as a personal attack. Bioinformatics IS hard! I sympathize that the OP may not have received adequate training in the discipline, and that the mentor's request may seem unreasonable. But those problems will not be solve by the members of this forum, however well-meaning their intentions.
I definitely agree and realize that it might not be in her best interest. However, it's not our or my responsibility to judge that. If someone needs help, asks for help and people here are willing to help then that's all what's required. I'm not in the position (and don't desire) to train or provide pedagogical support.
People have to ask themselves whether their problem is something they are going to fix by themselves, struggle for hours days or a week, or whether they need external help. Every problem you fix on your own makes you a better (bioinformatic) scientist. The next problem will be easier.
Given that OP has really a short time to get accepted (for a PhD if I interpret correctly) and that her situation is therefore far from optimal I think she can use a lot of help. A few hours ago she had never ran a piece of python code.
But lying to your future supervisor is not a great start, even if the requests are unreasonable. OP will have to realize that she'll have to work extremely hard to make this work.
@WouterDeCoster, I respect and admire your non-judgmental approach to the problem (and no, I'm not being sarcastic). Yet even you acknowledge the dubious nature of "lying to your future supervisor", and assistance in this case abets that effort. Also, it's difficult to envision a scenario where deception by the OP ends well. But you're right - we're all adults here, and entitled to make our own choices.
This is not a direct response for @Ram's post but I will put it here so it does not get buried in this long thread.
One thing we are forgetting is OP is from a country where most basic things many of us access easily are blocked (e.g. the other day she could not access docker website). I am not sure what exact PhD program she is trying to enroll in but programming is something a student can learn in the first semester or two (now-a-days students come from so diverse backgrounds for interdisciplinary programs that it is almost impossible to have a common yardstick for qualification).
I have continually reminded her to be open about her expertise/difficulties she is facing in solving these problems with her prospective mentor. The questions in the last month or so are perhaps out of desperation as she apparently has been asked to do "one more thing" out of the left field.
If she starts this placement on a wrong foot then this will not bode well for future. Hopefully she understands that.
The piece of code written by Gungor Budak expects the genbank file(s) in the current directory.
But I would recommend to save the following code as extractPositions.py (slightly modified)
import sys
from Bio import SeqIO
record = SeqIO.read(sys.argv[1], 'genbank')
for feature in record.features:
if feature.type == 'gene':
print 'Location:', feature.location
print 'Locus tag:', feature.qualifiers['locus_tag']
And execute that as python extractPositions.py yourfile.gb (just in your shell, not in python). Since it will print to stdout you can also redirect the output to a file.
but I have 1534 NCBI accessions for viruses, how I can perform so for all of them because manually it takes days to download gene bank files one by one and then run this code for each of them
I'm quite sure there should be a better way to download all the files in one go, I don't know how the ftp server is organized, perhaps wildcards can help. I didn't follow very well which files exactly you need :-/
Then we can modify the pythoncode to run on each .gb file in the directory and print the desired output, perhaps with a title or whatever information you would want.
this python code gives me whole of the Locus tags in a gene bank file, how I can extract the locus tags if only they encode for RNA dependent RNA polymerases and not all of them??
I changed the code like so
import sys
from Bio import SeqIO
record = SeqIO.read(sys.argv[1], 'genbank')
for feature in record.features:
if feature.type == 'gene':
if 'RNA dependent RNA polymerase' not in product['gene'][0].lower():
Indentation is important in python, it makes code blocks. The code below should work or give another error ;)
import sys
from Bio import SeqIO
record = SeqIO.read(sys.argv[1], 'genbank')
for feature in record.features:
if feature.type == 'gene':
if 'RNA dependent RNA polymerase' not in product['gene'][0].lower():
continue
print 'Location:', feature.location
print 'Locus tag:', feature.qualifiers['locus_tag']
As you can see, we will probably be able to fix your issue before the deadline. But you need to make sure that you understand what you are doing, and nog just running code you found online.
you know, I can download a gene bank file contains all of viruses genome but I only needs ssRNA viruses then I should first find a way to download gene bank file of ssRNA viruses then modify this python code because this code needs separate gene bank file and also code should only extract the genes if they encode for RNA dependent RNA polymerases
That code is designed to work on one GenBank record. You have 2088.
Can you see if you are able to grep "RAN polymerase"? Then you should be able to find the lines around that that you need (look at the -A -B options for grep).
import sys
from Bio import SeqIO
records = SeqIO.parse(sys.argv[1], 'genbank')
for record in records:
for feature in record.features:
And follow this by the rest of your code.
So this code loops over each record in all records from the file (I usually use the singular as name when I loop over the plural, that's intuitive). Then for each record it loops over the features.
import sys
from Bio import SeqIO
records = SeqIO.parse(sys.argv[1], 'genbank')
for record in records:
for feature in record.features:
print 'Location:', feature.location
print 'Locus tag:', feature.qualifiers['locus_tag']
import sys
from Bio import SeqIO
print("protein_id\tlocation")
for record in SeqIO.parse(sys.argv[1], 'genbank'):
for feature in record.features:
if feature.type == 'CDS' and "RNA-dependent RNA polymerase" in feature.qualifiers['product']:
print(str(feature.qualifiers['protein_id'][0]) + '\t' + str(feature.location))
she told I have 1530 NCBI gene bank file from ssRNA viruses while these viruses depend on RNA dependent RNA polymerase but I have only found about 100, therefore might be some one then have another name and should run a more flexible search on this gene bank file...
anyway, thank you that you and @ genomax2 did not leave me alone with such a stressful situation
I am sure she will reject me by 99 percent but I tried what I could during these months
she told this a mathematics and computer department and people should script... she told you know some basic in data analysis not bioinformatics
and suggested to apply for labs where need someone to analyze their data sets and before that I should learn a programming language
now this is a cloudy morning in Jena and I am thinking how to start from scratch
might be I will start with new questions in biostars
I never told her lie about my skills I sent my CV for her
Perhaps one of the many other all.*.tar.gz files @ ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ would be easier to parse and/or could even give you exactly what you want without any parsing..
thank you,
please consider this screen shot
http://q07i.imgup.net/Screenshot9673.png
I need to extract start...end position and corresponding gene IDs from this file
Did you try something out - I mean this is basic string extraction and stackoverflow is overloaded with these questions already. Might help to know what has been tried and model of output is expected