Hello!
I am very new to the world of python (or any coding), but have installed biopython which I believe will help me do what I am trying to accomplish. I am working with RNAseq data for my project, and wanting to run my genelists through DAVID to look at enrichment. The output of the RNAseq pipeline thru bowtie/tophat/cuffdiff, gives me log fold increases for each gene, with the chromosome identifier of SMaxxxxx. That would be fine, but DAVID does not seem to be able to handle the fact that my organism has 3 replicons (1 chromosome and 2 megaplasmids), that are SMa, SMb and SMc. When I try to use these identifiers as the gene list, DAVID is only able to find 1/3 (ie. one replicon) and identify them as genes. I was playing around and it seems like it can identify genes from all replicons if I use the gene identifier such as lacI, manB, etc. In the genbank file the gene notation I want is labelled: gene="xxxX"
I want to do two things:
- Go through each replicons genbank file and pull out the gene identifier for each gene in the 4 letter format
- Feed in a list of 150 genes in the SMaxxxx format, and have it output the 4 letter gene identifier for each gene
So far I have been able to read in a genbank file, but not pull out the information I want.
#to read in list of genes I want to pull out
with open('synthgenelistfordavid.txt') as f:
content = f.readlines()
wanted = [line.rstrip('\n') for line in open('synthgenelistfordavid.txt')]
#then I've gotten this far, import file and set up a for loop
from Bio import SeqIO
genbank_file = "sma.gbk"
for record in SeqIO.parse("sma.gbk", "genbank"):
Hi mdelow,
For the first question : You can use BioPython to parse a genBank file. You have to know that, with BioPython if you want to have an access to a specific part of the file, you have to read the CookBook to find exactly what are you looking for and its name in the documentation.
An alternative can be, if you have a "FLAG" to localize your information in the file, it is to develop you own Python Script. In you case, what are you interested in is the flag
\gene="XXX"
. So, with theif my_pattern in current_line
of python, you can try it.Warning : If you have various genes in the file, you need to take it in account! ;)
For instance, in the CookBook, part 4.2.3, you have an example of the records objects attributes and what they are containing. I think not that exists an attribute "gene_name" with the "XXX".
For the second one, can you please clarify the point?
Doing some more reading, what I want is the gene="XXX", the identifier I have is the
locus_tag
. For my organism, thelocus_tag
is in the format SMaXXXX - but the name of that identifier is locus_tag - sorry I hadn't figured that out earlier.So what I am trying to do in both cases is use a list of locus tags and loop thru every coding sequence in the genbank file.
The list for #2 is roughly 150 locus tags, the list for #1 is more like 6,000 locus tags (every gene in the genome).
So for #2, if the locus_tag in each CDS matches a
locus_tag
in my list, I want it to print thatlocus_tag
and the corresponding gene identifier.Same thing for #1, but the reality is in that case, that it is just all the genes. I was thinking I could use a list of all the genes to keep the same "list" approach to the problem, but another approach would be to ask biopython to print the
locus_tag
and corresponding gene ID for every CDS in the genbank file.My genbank file only has one LOCUS, but many coding sequences (CDS).
Here is an example of some CDS side by side in the file. All CDS have a
locus_tag
, not all CDS have a gene name, because many genes in the genome are not identified genes, they just code for "hypothetical proteins". For the downstram process of using the DAVID tool, I can't use hypothetical proteins anyways, so only thelocus_tag
s with gene names are useful.Is this making more sense?
I solved it by doing this: