Hello! What's the meaning of reading fastQ file alone? And how to read fastQ file with Python?
Hello! What's the meaning of reading fastQ file alone? And how to read fastQ file with Python?
Assuming typical four-line FASTQ with no funny business:
#!/usr/bin/env python
import sys
import os
def process(lines=None):
ks = ['name', 'sequence', 'optional', 'quality']
return {k: v for k, v in zip(ks, lines)}
try:
fn = sys.argv[1]
except IndexError as ie:
raise SystemError("Error: Specify file name\n")
if not os.path.exists(fn):
raise SystemError("Error: File does not exist\n")
n = 4
with open(fn, 'r') as fh:
lines = []
for line in fh:
lines.append(line.rstrip())
if len(lines) == n:
record = process(lines)
sys.stderr.write("Record: %s\n" % (str(record)))
lines = []
Edited to use Wouter's faster dict comprehension, and edited to add some error checking and to remove an unnecessary last-pass at the process function.
Example input:
$ cat test.fq
@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1
TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA
+
11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221
Example run:
$ ./parseFastqRecords.py test.fq
Record: {'name': '@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1', 'sequence': 'TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA', 'optional': '+', 'quality': '11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221'}
I bet this beats the crap out of the Biopython parser :-)
Because the alternative was writing for my review, I looked for some variation to the process()
function. It turns out that a dict comprehension is slightly more performant, see this SO post and replicated on my system:
{ks[idx]:val for idx, val in enumerate(lines)}
: 475 ns ± 6.99 ns per loop
{k: v for k, v in zip(ks, lines)}
: 408 ns ± 6.66 ns per loop
Nice test! I don't know why I get so much pushback in other questions when I mention that BioPython/SeqIO is a slow way to parse files. It's a useful library, no question, but it might be worth exploring why it is slow, and for devs to start working in some modern Pythonic ways to making this fast for typical use cases.
Well, I believe part of the reason is that it doesn't make assumptions. As far as I know a fastq record can be wrapped (I believe this is also what cschu181 means in the comment https://www.biostars.org/p/317524/#317602). Your code would break, Biopython wouldn't. Nevertheless, fastq files like that belong in hell.
Interestingly, if you dig down inside SeqIO, there are more basal functions that could run faster. From the source code header:
Some of these parsers are wrappers around low-level parsers which build up
38 SeqRecord objects for the consistent SeqIO interface. In cases where the
39 run-time is critical, such as large FASTA or FASTQ files, calling these
40 underlying parsers will be much faster - in this case these generator
41 functions which return tuples of strings:
42
43 >>> from Bio.SeqIO.FastaIO import SimpleFastaParser
44 >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator
I wouldn't be surprised if column modes on late-80s, early-90s terminal environments (vt100 etc.) determined how FASTA and other early genomic formats were written, so that files could be inspected and worked with by humans. Perhaps FASTQ is an offshoot of those formats, in that respect.
This implementation looks like it will run very slow when reading a very large FASTQ file (with 200 million reads). A dictionary and a user-defined Python function is called for every single read.
Here is a more efficient and fast implementation of a FASTQ reader in Python. https://github.com/lh3/readfq/blob/master/readfq.py
A fastq file contains 4 lines per sequence:
@M01757:9:000000000-AN67B:1:1101:13276:1772 1:N:0:1
TGACAGGACCAGTCACGCTTTTTCTCGGAGAAGATCAAAATCTGTCGTCTTTATTGACCATATACATAGTTCAGTCGCTGTACAACACTTATCTGAAA
+
11AAAFA1AAAFGDDF11EFGGH0F300000001D1111111D22A/BAFFG2DF1211111D222D212D222D1//B//1D21B//BF1B1F2221
The header (@....
on line 1) contains assorted information such as the machine identifier, run ID and so on (it will vary by origin).
On the next line comes the sequence itself. Nothing special about this.
Following that is a +
on line 3. Again, nothing special about that. It just demarcates the sequence from the following line.
On the 4th line is the quality per-base. It uses a special encoding called "Phred" scores, which maps a quality value (0 - 40 usually), to an ASCII character. You can find out more about the encoding here for example. The score is therefore encoded by a single byte, and the ASCII code is equal to the quality score + 33.
You can read FASTQ files with BioPython, or any number of alternatives. What the right answer here is very much depends on what you want to do. BioPython would be a biblically slow way to process large FASTQ files for example.
Also for completeness sake, the 4-line per record (sequence) in FASTQ is not mandatory, but it is a practical decision as it may be a bit complicated to parse correctly "multi-line" fastq files:
It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.
Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with.
FASTQ format is human readable. There is no trick to simply reading a fastq file.
If you really want to read FASTQ files using Python, BioPython's SeqIO module should be able to read the files.
But as the other poster said, it's going to be really slow and inappropriate for the task if you're trying to analyze a large number of FASTQ files.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Is this an assignment question?
If it is not you need to clarify what you are trying to? If you have written some code then please post that in your original question if you want to get assistance with an error or question.