I have a augustus output file, which I further edited and now it looks like this.
gdna.0.1
gnl|Hsal_3.3|scaffold48 AUGUSTUS gene 3912 10875 0.01 - . ID=g1
gnl|Hsal_3.3|scaffold48 AUGUSTUS transcript 3912 10875 0.01 - . ID=g1.t1;Parent=g1
gnl|Hsal_3.3|scaffold48 AUGUSTUS transcription_end_site 3912 3912 . - . Parent=g1.t1
gdna.0.2
gnl|Hsal_3.3|scaffold48 AUGUSTUS exon 3912 5899 . - . Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS stop_codon 4604 4606 . - 0 Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS CDS 4604 5656 0.89 - 0 ID=g1.t1.cds;Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS start_codon 5654 5656 . - 0 Parent=g1.t1
gdna.0.3
gnl|Hsal_3.3|scaffold48 AUGUSTUS exon 9919 10875 . - . Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS transcription_start_site 10875 10875 . - . Parent=g1.t1
gdna.0.4
gnl|Hsal_3.3|scaffold112 AUGUSTUS gene 9196 31470 0.01 + . ID=g2
gnl|Hsal_3.3|scaffold112 AUGUSTUS transcript 9196 31470 0.01 + . ID=g2.t1;Parent=g2
gnl|Hsal_3.3|scaffold112 AUGUSTUS transcription_start_site 9196 9196 . + . Parent=g2.t1
gdna.0.5
gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 9196 9505 . + . Parent=g2.t1
gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 9623 10706 . + . Parent=g2.t1
gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 10757 11547 . + . Parent=g2.t1
gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 12892 13859 . + . Parent=g2.t1
I have a list of gene ids like this:
gdna.0.2
gdna 0.4
I want to get the annotations of these gene ids alone into a new file from the augustus output file I specified above, I initially wanted to use a awk to do this, but since the input file is huge and also the number of gene ids for which I needed the annotation entries were also huge, I understood its best to write a script.
The output should look like this
gdna.0.2
gnl|Hsal_3.3|scaffold48 AUGUSTUS exon 3912 5899 . - . Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS stop_codon 4604 4606 . - 0 Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS CDS 4604 5656 0.89 - 0 ID=g1.t1.cds;Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS start_codon 5654 5656 . - 0 Parent=g1.t1
gdna.0.4
gnl|Hsal_3.3|scaffold112 AUGUSTUS gene 9196 31470 0.01 + . ID=g2
gnl|Hsal_3.3|scaffold112 AUGUSTUS transcript 9196 31470 0.01 + . ID=g2.t1;Parent=g2
gnl|Hsal_3.3|scaffold112 AUGUSTUS transcription_start_site 9196 9196 . + . Parent=g2.t1
I tried using dictionaries for this purpose, but have no luck in moving further after this.
from collections import defaultdict #this will make your life simpler
f = open('test.txt','r')
list=defaultdict(str)
name = ''
for line in f:
if line.startswith('gdna.'):
name = line[1:-1]
continue
list[name]+=line.strip()
print list
Can anyone suggest if there is a more effective way to do this?
Thanks
Thanks a lot. It is very helpful and exactly what i was looking for.