Python: grep the strings whiitn "[ ]"
2
0
Entering edit mode
7.0 years ago
horsedog ▴ 60

Hi ,I'm using bioinformatics tool parsing my sequences, here I'd like to extract some information i need. There are thousands of query names corresponding to different sequences, like this

  >lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

What I need is "[location=(207..914)]" ; How I can achieve this? In different sequences the name would be different, I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one, and sometimes there is no "locations", meaning no cds in this sequence so just give it a miss. I'm thinking to use "grep" or "re.search" but it didn't work:

for line in open(file,"r").readlines():   
  if "location=" in line:  
    cds = grep “[location = *]” line  
  print(cds)

Does anyone have idea?
Many thanks!

python • 1.7k views
ADD COMMENT
0
Entering edit mode

If you want to stick with re then

for line in open("test", "r").readlines():
        if "location" in line:
                loc = re.split(r" ", line)
                for m in loc:
                        if "location" in m:
                                print(m)
ADD REPLY
0
Entering edit mode

good one:). Further shortening the code:

import re
for line in open("test.txt", "r").readlines():
    if "location" in line:
        print(line.split()[5])

output:

[location=(207..914)]
[location=(2070..9140)]
[location=(20700..91400)]

input:

$ cat test.txt
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]
ADD REPLY
0
Entering edit mode

I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one

otherwise: grep -e 'location' myfile.fasta | cut -f 6 -d ' ' > locations.txt

ADD REPLY
1
Entering edit mode
7.0 years ago
st.ph.n ★ 2.7k

Grep is not a Python command. If you're sticking with Python, and not bash commands, here's a quick strip to get you started:

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
        for line in f:
                # find FASTA headers. 
                if line.startswith(">"):
                        # check if 'location' in header
                        if 'location' in line:
                                # split header by spaces into list
                                x = line.strip().split(' ')
                                # for each item in header check if 'location' is in that item
                                for i in x:
                                        if 'location' in i:
                                                print i

Prints:

[location=(207..914)]

save as find_loc.py, run as python find_loc.py myfile.fasta > locations.txt

ADD COMMENT
1
Entering edit mode
7.0 years ago
>>> import re
>>> import os
>>> with open ("test.txt","r") as t:
    f=t.read()
>>> pattern=re.compile('\[location=\([0-9]+..[0-9]+\)\]')
>>> re.findall(pattern, f)

output:

===========================

['[location=(207..914)]',
 '[location=(2070..9140)]',
 '[location=(20700..91400)]',
 '[location=(207000..914000)]']

==============================

>>> print (f)

=====================================

output:

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207000..914000)] [gbkey=CDS]

======================================

ADD COMMENT

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6