Hi ,I'm using bioinformatics tool parsing my sequences, here I'd like to extract some information i need. There are thousands of query names corresponding to different sequences, like this
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]
What I need is "[location=(207..914)]" ; How I can achieve this? In different sequences the name would be different, I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one, and sometimes there is no "locations", meaning no cds in this sequence so just give it a miss. I'm thinking to use "grep" or "re.search" but it didn't work:
for line in open(file,"r").readlines():
if "location=" in line:
cds = grep “[location = *]” line
print(cds)
Does anyone have idea?
Many thanks!
If you want to stick with
re
thengood one:). Further shortening the code:
output:
input:
otherwise:
grep -e 'location' myfile.fasta | cut -f 6 -d ' ' > locations.txt