Hi All,
I'm trying to make a script that will allow me to take a given genbank file and convert it a fasta and a bed file in the same script so I can then use it with the mapper and viewers downstream. The file conversion itself is pretty straight forward. I'm not sure which of the developers for Biopython made SeqIO.convert but they have all my gratitude. In fact all of Biopython I have found to be really empowering and enjoyable, however as I'm still a inexperienced I'm having some trouble extracting the information I want.
So what I want is the name of the gene, start, stop, and strand just the basics. and when I read in the file I can see this.
genome.features
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1192), strand=1), type='source'),SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='CDS')]
which I know is a list of the SeqFeatures but I'm having trouble getting to that information so I was exploring and printed what each item was:
for I in genome.features:
print(i,type(i))
and got:
type: source
location: [0:1192](+)
qualifiers:
Key: country, Value: ['USA']
Key: db_xref, Value: ['taxon:38170']
Key: mol_type, Value: ['genomic RNA']
Key: organism, Value: ['Avian orthoreovirus']
Key: segment, Value: ['S4']
Key: strain, Value: ['AVS-B']
<class 'Bio.SeqFeature.SeqFeature'>
type: gene
location: [23:1127](+)
qualifiers:
Key: gene, Value: ['sigma-NS']
<class 'Bio.SeqFeature.SeqFeature'>
type: CDS
location: [23:1127](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: db_xref, Value: ['GI:315466580']
Key: gene, Value: ['sigma-NS']
Key: product, Value: ['sigma-NS protein']
Key: protein_id, Value: ['CBX25032.1']
Key: translation, Value: ['MDNTVRVGVSRNTSGAAGQTVFRNYYLLRCNISADGRNATKAVQSHFPFLSRAVRCLSPLAAHCADRTLRRDNVKQILTRELPFPSDLINYAHHVNSSSLTTSQGVEAARLVAQVYGEQLSFDHIYPTGSATYCPGAIANAISRIMAGFVPHEGDNFTPDGAIDYLAADLVAYKFVLPYMLDIVDGRPQIVLPSHTVEEMLSNTSLLNSIDASFGIESKSDQRMTRDAAEMSSRSLNELEDHEQRGRMPWKIMTAMFAAQLKVELDALADERVESQANAHVTSFGSRLFNQMSAFVPIDRELMELALLIKEQGFAMNPGQVASKWSLIRRSGPTRPLSGARLEIRNGNWTIREGDQTLLSVSPARMA']
<class 'Bio.SeqFeature.SeqFeature'>
I know that the Key Value pair means its a dictionary and when I try to just print the genome.features.type I get a list error so I think each feature is a list with two lists and a dictionary inside them but I can not seem to figure out how to extract the information.
Can anybody point me in the right direction in either an explanation or the correct documentation I need to read I would be very grateful.
Can you change the title to something better and more meaningful? (keeping in mind about the people visiting the site, how could the get benefit from this question).
sure, no problem