Initially my problem was to extract all entries from a FASTQ file with names not present in a FASTA file. Using biopython I wrote:
from Bio.SeqIO.QualityIO import FastqGeneralIterator
corrected_fn = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn = "differences_fastq.ftq"
corrected_names = []
for line in open(corrected_fn):
if line[0] == ">":
read_name = line.split()[0][1:]
corrected_names.append(read_name)
input_fastq_fn = uncorrected_fn
corrected_names.sort()
handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)) :
if title not in corrected_names:
handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()
Problem is, it is very slow. On 2Ghz workstation starting from a local disc it can take two days per pair of files:
- 4870868 seqs in FASTQ
- 4299464 seqs in FASTA
Removing title from corrected_names
speeds up things a bit (this version I used for running).
Am I doing something obviously silly or simply FastqGeneralIterator
is not a best construct to use here? While I like Python best, I am open to answers in Perl/Ruby.
Slicing and dicing FASTQ files based on lists seems to be fairly common task.
Edit: Python 2.6.4, biopython 1.53, Linux Fedora 8.
Edit 2:
- corrected one line of code, see comment to giovanni
- code snippet taken from: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
input_fastq_fn is not defined (in the 4th-to-last line)
@giovanni: while cleaning up script for posting I cut one line to many (input_fastq_fn = uncorrected_fn) see also Edit 2 in the post
well, you don't need the input_fastq_fn variable anyway, you can use uncorrected_fn directly.