calling a function using glob.glob
3
0
Entering edit mode
7.4 years ago
bio90029 ▴ 10

HI, Hopefully someone can help me with this. I have prepared a script to extract data from a file, this part work very well, and does what I need to be done. The problem comes when I am using glob.glob, and subprocess to call the function. I keep having the above error message, and I do not know how to handle it. error message:

**File "parsing_blast.py", line 45, in <module> my_file=subprocess.Popen(cmd) File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__ errread, errwrite) File "/usr/lib64/python2.6/subprocess.py", line 1238, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

Thanks your help

from Bio.Blast import NCBIXML
from Bio import SeqIO, SearchIO
import sys, glob, subprocess, os

folders = glob.glob('/home/me/my_folder/H*')
print folders
for folder in folders:
    my_files=glob.glob(folder + '/*.xml')
    print my_files
    def parsing_blast():

        results_handle=open(my_files[0])
        blast_results=NCBIXML.parse(results_handle)
        #blast_results=NCBIXML.parse(results_handle)
        output_handle=open(folder + ' my_data_parse.xml','w')   
        #to extract some information from the blast file
        for blast_result in blast_results:
            sequence_length=blast_result.query_letters #this is the length of the sequence
            gene=blast_result.query #gene name
            #print 'The length is:', sequence_length #check point
            #print gene         #check point
            for description in blast_result.descriptions:
                title=description.title  #query seq name
                #print description.title #check point
                for alignment in blast_result.alignments:               
                    for hsp in alignment.hsps:
                        identity=hsp.identities #matching bases
                        num_gaps=hsp.gaps       #number of gaps
                        #print identity         #check point
                        #print num_gaps         #check point

                        per_identities=float(identity)/float(sequence_length)*float(100) 
                        #print per_identities   #check point
                        #sys.exit()

                        extracted_data= (gene + ',' + title + ','+ 'number_gaps: ' + str(num_gaps) +','+ 'per_identity: '+ str(per_identities) +'\n')

                        output_handle.write(extracted_data)
        output_handle.close()               

                    #sys.exit()   
    parsing_blast()



    print 'The file has been created'
biopython python glob.glob • 3.9k views
ADD COMMENT
0
Entering edit mode

The problem comes when I am using glob.glob, and subprocess to call the function

Why do you use subprocess to call the function?!

ADD REPLY
0
Entering edit mode

I have a hundred files, all starting by H, and in all of them I have an xml file I would like to parse. I do not want to do it one by one. So I want the script to get the information I want from the file within the H* folder, storing that information on another file. When the file is created in one folder to move to the next folder, and so on. I used glob.glob and subprocess before but within a function. I just wanted to use it from outside the function so I could add another function.

ADD REPLY
0
Entering edit mode

I used glob.glob and subprocess before but within a function.

As far as I understood this is an entirely different use case. Instead, you need the multiprocessing module for parallelizing a function across many files.

ADD REPLY
0
Entering edit mode

Hi, I have re-edited the script, and now it works perfectly. But now I need to find out how to tell the programme to store the created file within the H files. Any help in that area, please.

ADD REPLY
0
Entering edit mode

I edited my answer to set the output_handle file to a file within the XML file source directory. Is that what you meant?

ADD REPLY
2
Entering edit mode
7.4 years ago
Dan D 7.4k

The error message you're seeing is unrelated to your use of the glob function.

This line:

cmd=['parsing_blast']

is equivalent to typing

parsing_blast

on the command line. There's apparently no executable by that name available.

Are you trying to asynchronously call the parsing_blast function you've defined?

Some quick feedback while I'm looking at your code:

You can simplify your glob query:

my_files=glob.glob('/home/me/my_folder/H*/*.xml')

And it's more efficient to define your function outside of the loop. Else you're recreating it with every loop iteration, which seems unnecessary unless I'm missing something.

ADD COMMENT
0
Entering edit mode

Thanks, I had tried putting the whole path in my_files but I could not make it work. I would like the script to parse the xml files in my folders H*. I have a hundred folders, and all contained an xml file. If I work just with the function in a folder it works perfectly, I am trying to produce the script to extract the data one after the another. Thanks for your help

ADD REPLY
0
Entering edit mode

I have re-edited the script, and now is working. Now I need to find out how to store the created file within my H files.

ADD REPLY
1
Entering edit mode
7.4 years ago
steve ★ 3.5k

In addition to the others' comments, if I were to try to accomplish the task you've described:

I have a hundred files, all starting by H, and in all of them I have an xml file I would like to parse.

I would use a script like this:

#!/usr/bin/env python

import os

def find_H_dirs(parent_dir):
    '''
    Find all the dirs in the parent_dir that start with H
    '''
    matches = []
    for item in os.listdir(parent_dir):
        item_path = os.path.join(parent_dir, item)
        if os.path.isdir(item) & item.startswith("H"):
            matches.append(item_path)
    return(matches)

def find_XML_files(dir):
    '''
    Find all the .xml files in a dir
    '''
    matches = []
    for item in os.listdir(dir):
        item_path = os.path.join(dir, item)
        if os.path.isfile(item_path) & item.endswith(".xml"):
            matches.append(item_path)
    return(matches)

def process_XML_file(XML_file, output_handle):
    '''
    Do a thing to the XML file
    '''
    print("Put your code for processing the {0} file here.".format(XML_file))


parent_dir = "/path/to/parent_dir"
# output_handle = "/path/to/my_data_parse.xml" # if you want it to always go to the same file

H_dirs = find_H_dirs(parent_dir = parent_dir)

for H_dir in H_dirs:
    output_handle = os.path.join(H_dir, "my_data_parse.xml")
    for XML_file in find_XML_files(dir = H_dir):
        process_XML_file(XML_file = XML_file, output_handle = output_handle)

It may be technically less efficient, but it is much simpler to write and understand, and will be easier to expand and re-use in the future.

edit: updated output_handle as per request in the comments

ADD COMMENT
1
Entering edit mode

Using almost entirely the same code with some multiprocessing code in the loop will allow you to run it in parallel.

ADD REPLY
0
Entering edit mode
7.4 years ago
Rodrigo ▴ 190

Seems like the problem is that cmd = ['parsing blast'] is a sequence containing a function and not a process.The subprocess module is for spawning processes and doing things with their input/output - not for running functions, as it is explained Here.

ADD COMMENT

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6