Use Python To List Filename With Specific Extensions.
3
1
Entering edit mode
13.5 years ago
Bioscientist ★ 1.7k

How can I use python to list the filenames of all FASTQ files? Use os.listdir()? But how to specify on FASTQ file?

Also, after this I want to do some further analysis on these files; eg. zcat filenames.recal.fastq.gz |wc

How can I do such things using Python?

THanks!

Edit: I'm writing python myself. My python script goes like:

#!/usr/bin/python

import os, sys,re,gzip

path = "/home/xxxx/Downloads"

for file in os.listdir(path):
  if re.match('.*\.recal.fastq.gz', file):
    text = gzip.open(file,'r').read()
    word_list = text.split()
    number = word_list.count('J') + 1
    if number != 0:
      print file

searching fastq.gz goes well, but problems are:

Traceback (most recent call last):
  File "try.py", line 9, in <module>
    text = gzip.open(file,'r').read()
  File "/usr/lib/python2.7/gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "/usr/lib/python2.7/gzip.py", line 89, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: 'ERR001274_1.recal.fastq.gz'

I think there's sth wrong with the gzip, and also why can't I open ERR001274, it DOES exist ......any ideas? thx!

python • 55k views
ADD COMMENT
1
Entering edit mode

try glob. import glob; print glob.glob("*.fastq"). you might clarify your question, why not just do it on the command-line?

ADD REPLY
1
Entering edit mode

command-line alternative: http://www.infoanda.com/resources/find.htm

ADD REPLY
0
Entering edit mode

Your Python code is not showing up correctly formatted, please indent it with four spaces. See: http://meta.stackoverflow.com/questions/22186/how-do-i-format-my-code-blocks

ADD REPLY
0
Entering edit mode

You need to provide the full path to the file line line 3 of your loop - the file in your first loop is just a string of the file name within the Downloads folder.

ADD REPLY
8
Entering edit mode
13.5 years ago

To find files with a specific extension, use glob.

import glob
import gzip

filenames = glob.glob('*.fastq.gz')

for filename in filenames:
    with gzip.open(filename) as f:
        data = f.read()
        number_of_characters = len(data)
        # the last line usually has no '\n' so add 1 to count
        number_of_lines = data.count('\n') + 1
        number_of_words = len(data.split())
        print "%d %d %d %s" % (number_of_lines, number_of_words, 
                               number_of_characters, filename)

Note that this naive Python implementation is many times slower then the command line version. To speed up the Python version, increase the block size to 1MB.

import glob
import gzip

BLOCK_SIZE = 2**20
filenames = glob.glob('*.fastq.gz')

number_of_characters = number_of_lines = number_of_words = 0
for filename in filenames:
    with gzip.open(filename) as f:
        for block in iter(lambda: f.read(BLOCK_SIZE), ""):
            number_of_characters += len(block)
            number_of_lines += block.count('\n')
            number_of_words += len(block.split())
        print "%d %d %d %s" % (number_of_lines, number_of_words, 
                               number_of_characters, filename)
ADD COMMENT
1
Entering edit mode
13.5 years ago
David W 4.9k

Hey,

This is really a 'pure' python question, and probably better asked at Stack Overflow or similar. But this answer is easy enough. Using a list comprehension:

fastq = [f for f in os.listdir('.') if f.endswith('.fastq')]

EDIT: forgot the other bit of your question. You should be able to work this one out - look at the gzip module to read your files, then loop through the lines (I presume you want wc -l) either using count += 1 for each line or enumerate() to get a counter running.

ADD COMMENT
0
Entering edit mode
13.5 years ago

First: do not use 'file' as variable name. 'File' is the name of a python builtin variable, if you overwrite it you can get a weird behavior.

Second: you need to provide the correct path to the gzip file, concatenating the value of the path variable.

Third: it is better to use the glob module as suggested in another answer.

import os, sys,re,gzip

path = "/home/xxxx/Downloads"

for filename in os.listdir(path):       # Do not use 'file' as a variable name
  if re.match('.*\.recal.fastq.gz', filename):
    text = gzip.open(path + '/' + filename,'r').read() # You need to attach 'path' to the file name
    word_list = text.split()
    number = word_list.count('J') + 1
    if number != 0:
      print file
ADD COMMENT

Login before adding your answer.

Traffic: 2083 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6