Question

Use Python To List Filename With Specific Extensions.

1

Entering edit mode

13.5 years ago

Bioscientist ★ 1.7k

How can I use python to list the filenames of all FASTQ files? Use os.listdir()? But how to specify on FASTQ file?

Also, after this I want to do some further analysis on these files; eg. zcat filenames.recal.fastq.gz |wc

How can I do such things using Python?

THanks!

Edit: I'm writing python myself. My python script goes like:

#!/usr/bin/python

import os, sys,re,gzip

path = "/home/xxxx/Downloads"

for file in os.listdir(path):
  if re.match('.*\.recal.fastq.gz', file):
    text = gzip.open(file,'r').read()
    word_list = text.split()
    number = word_list.count('J') + 1
    if number != 0:
      print file

searching fastq.gz goes well, but problems are:

Traceback (most recent call last):
  File "try.py", line 9, in <module>
    text = gzip.open(file,'r').read()
  File "/usr/lib/python2.7/gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "/usr/lib/python2.7/gzip.py", line 89, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: 'ERR001274_1.recal.fastq.gz'

I think there's sth wrong with the gzip, and also why can't I open ERR001274, it DOES exist ......any ideas? thx!

python • 55k views

ADD COMMENT • link updated 13.4 years ago by Giovanni M Dall'Olio 28k • written 13.5 years ago by Bioscientist ★ 1.7k

1

Entering edit mode

try glob. import glob; print glob.glob("*.fastq"). you might clarify your question, why not just do it on the command-line?

ADD REPLY • link 13.5 years ago by brentp 24k

1

Entering edit mode

command-line alternative: http://www.infoanda.com/resources/find.htm

ADD REPLY • link 13.5 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Your Python code is not showing up correctly formatted, please indent it with four spaces. See: http://meta.stackoverflow.com/questions/22186/how-do-i-format-my-code-blocks

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.5 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

You need to provide the full path to the file line line 3 of your loop - the file in your first loop is just a string of the file name within the Downloads folder.

ADD REPLY • link 13.4 years ago by David W 4.9k

score 8 · Answer 1 · 2011-06-25

To find files with a specific extension, use glob.

import glob
import gzip

filenames = glob.glob('*.fastq.gz')

for filename in filenames:
    with gzip.open(filename) as f:
        data = f.read()
        number_of_characters = len(data)
        # the last line usually has no '\n' so add 1 to count
        number_of_lines = data.count('\n') + 1
        number_of_words = len(data.split())
        print "%d %d %d %s" % (number_of_lines, number_of_words, 
                               number_of_characters, filename)

Note that this naive Python implementation is many times slower then the command line version. To speed up the Python version, increase the block size to 1MB.

import glob
import gzip

BLOCK_SIZE = 2**20
filenames = glob.glob('*.fastq.gz')

number_of_characters = number_of_lines = number_of_words = 0
for filename in filenames:
    with gzip.open(filename) as f:
        for block in iter(lambda: f.read(BLOCK_SIZE), ""):
            number_of_characters += len(block)
            number_of_lines += block.count('\n')
            number_of_words += len(block.split())
        print "%d %d %d %s" % (number_of_lines, number_of_words, 
                               number_of_characters, filename)

score 1 · Answer 2 · 2011-06-25

Hey,

This is really a 'pure' python question, and probably better asked at Stack Overflow or similar. But this answer is easy enough. Using a list comprehension:

fastq = [f for f in os.listdir('.') if f.endswith('.fastq')]

EDIT: forgot the other bit of your question. You should be able to work this one out - look at the gzip module to read your files, then loop through the lines (I presume you want wc -l) either using count += 1 for each line or enumerate() to get a counter running.

score 0 · Answer 3 · 2011-06-27

First: do not use 'file' as variable name. 'File' is the name of a python builtin variable, if you overwrite it you can get a weird behavior.

Second: you need to provide the correct path to the gzip file, concatenating the value of the path variable.

Third: it is better to use the glob module as suggested in another answer.

import os, sys,re,gzip

path = "/home/xxxx/Downloads"

for filename in os.listdir(path):       # Do not use 'file' as a variable name
  if re.match('.*\.recal.fastq.gz', filename):
    text = gzip.open(path + '/' + filename,'r').read() # You need to attach 'path' to the file name
    word_list = text.split()
    number = word_list.count('J') + 1
    if number != 0:
      print file