Entering edit mode
4.8 years ago
radha.jg
•
0
I have the following code:
import sys
from BCBio import GFF
file = sys.argv[1]
for record in GFF.parse(file):
print([record.seqid,record.source, record.start, record.end, record.strand])
My question is how can extract the ids, the chromosome, the coordinates and the strand from a GFF3 files using python/biopython (I'm bounded to python because this is part of a bigger program writen in said language). I thought since record.seqid, record.source, record.start, record.end, record.strand is used to write new GFF, this could work, but clearly it doesn't.
I need this solved ASAP but also I need to learn, so please, anyone out there, please help.
Thanks in advance
Jenifer
Try this package in python - https://pythonhosted.org/gffutils/
This package creates a "local" database from your gff file that you can interact with in relatively easier manner. For instance, I was able to extract start stop position of gene along with exons coding for that gene and their start and stop positions. Hope this helps!
Ok, so now what I need is an iterator to get all CDS with all start, end and and strand for a chromosome. Does this library provide one, or how do I get this.
Context: For each chromosome I calculated the GC_Skew and now I need to show in that plot, the zones that have CDS, distinguishing the zones that have CDS and the strand. The function:
GC_Skew parameter is a list with the GC-skew calculation, window and overlap is the size of the window and the overlap, salida is the name of the chromosome (seqid) and anotation can be None or a created database like this:
anotation = gffutils.create_db(fn, dbfn = anotation_file)
So the thing is I don't have a clear idea how gffutils works and I don't know if I can get all CDS to add that data to the plot If you can give me a hand in this, I will apreciate itThe tutorial (https://pythonhosted.org/gffutils/) is very straightforward, for example let us presume that the local database that you have created is by the name db and the gene whose CDS you would like to extract is assigned name gene, then you can accomplish what you want to by the following:
See if this helps! To be honest I do not have an entire idea of what you are looking to do, if this does not fix your problem then maybe elaborate your next problem and we can see if we can help
Yes, I saw that, so now what I need to do is to, get a list of genes for one seqid (salida parameter), from the gff3 database. Then construct a list of tuples for each gene and then, add that data to the plot... Mmm I really need to meditate how to do this
EDIT: ok, thing is, I just have the chromosome (seqid) so in order for your solution to work, I need to get a list of the genes, but i don't know how to get it. The only thing I know is, for my script so work, is something like this:
The thing is, how do I get ge genes object/list?
Is it possible for you to share a short snapshot of your gff file?