Code: https://github.com/JohnLonginotto/uq/
Install: If you already have numpy, there's no installation at all. Just run the script. (pypy compatible)
Job: uQ re-encodes FASTQ files into binary, in such a way that you can compress the output with your favourite compression program and not have to rely on some other sketchier ones that may or may not be around in 10 years from now. It's only dependency is numpy, however I recommend you run it in pypy for speed.
Motivation: I've noticed a lot of people who have made FASTQ compression tools make large assumptions about the data (only 4 bases + N, N must have specific quality scores, doesn't work with variable-length FASTQ entries, etc. uQ is a different sort of philosophy, as it focuses solely on the encoding, removing unnecessary entropy, and finding the best encoding for your compressor - whatever it is. As a result, it often results in the smallest compressed files out of all FASTQ compressors. In an ideal world, someone would write a compressor for uQ files (or more specifically, structured tables in general)
Please check out the README on GitHub for more infos :)
Update: Now uses CFFI and a C memory array instead of Numpy array to speed up encoding by around 25x. This speeds up the encoding of small/large FASTQ files by roughly 5-25x (depends on how much of the work is encoding).
So
uQ --sort
would be equivalent toclumpify.sh
defaults from BBMap?Hm, good question - the DNA is encoded in some fixed-length binary, so "A" might become "00", C "01", G "10" and T "11" like 2bit, and thus --sort would sort from AAAAAAA to TTTTTTTT. However, the actual binary used is flexible, and depends on the data, so A might not always be '00'. Also if there's variable-length data, smaller sequences will sort before longer ones. If there's an "N" or some other characters, the encoding itself might change, and A will be "000" or something else entirely (it's dependant on frequency of the base - the highest frequency has the lowest encoding.
So now that you mention it, yes, --sort DNA does not mean the data will come back in ASCII sorted order, but in the sorted order of the binary encodings used.
To my knowledge, clumpify uses kmers and sort of aligns DNA sequences to find an order that is clustered but not sorted. I think. I still have to test mapping speeds of clumpify'd FASTQ and --sort DNA'd FASTQ.
I suspect that Clumpify would be be additive to the compression of uQ, and that it when using them together, it would be best to run uQ on Clumpified data without using the --sort flag. Clumpify sorts by internal kmers rather than the prefix, so that overlapping reads end up nearby even if they don't have the exact same start location.
It's kind of hard to say, though. I don't know what effects a binary encoding will have on gzip, since I think gzip operates on a byte level and might not be able to take advantage of a 3-bit encoding (meaning, realize that two different 3-bit encoded strings of nucleotides are partially redundant if they start at a different position).
I think the answer is yes that one has to start with uncompressed fastq files for uQ (piped input acceptable?).
Yup - unfortunately. Figuring out what the minimum safest binary codes that can be used for the input data requires reading all the data once, then knowing, ok, there's only 4 characters of DNA, i can use 2 bits, or OK, there are 10 quality scores so i can use 3 bits, etc. Ideally this information would be known prior when making the uQ file (say, on an illumina machine) and thus the checks can be skipped, but as a safe/reliable way to compress data for long-term storage, those checks are unfortunately necessary and thus a plain fastq must be provided :(
....although, now that i think about it, it would be possible to do if one just said DNA is always stored in 3bits, quality values always in 6bits. That's true 99.999% of the time. Then the encoding could be hard-coded in an only 1 pass through the data is needed. Perhaps i could add a --onepass parameter. Adding bits to symbols doesn't increase the compressed filesize by much at all (in fact sometimes improves it), but it will increase the uncompressed file size, which is what I focused on as for whatever reason i wanted 100 fastq files in memory at once. But a --onepass would be easy enough to put in. Hm. You always ask great questions genomax :) Thank you
I am always thinking about how normal users can use these programs. Biggest uncompressed NovaSeq data is ~300G each (R1/R2) for S1 flowcells. Since S4 flowcells (coming later this year) are supposed to put out 4-5x more data we may have uncompressed R1/R2 sequence files, each of which could be a full TB!
What about just throwing away seqs with N's? How much information would really be lost vs. space saved?
Oo, that's a tough one to answer. Certainly, Ns take up huge amounts of space, as uQ for repetitive datasets will store only the unique occurrences of the data, and a lot of "uniqueness" is caused by sequencing error. I'll test a with/without on some large files tonight to see. But on the other hand, leaving the bad data in a dataset might allow future researchers to better judge the quality of the good data. If we only ever save good data, we'll end up in trouble :) At the very least, raw data isn't hiding anything from you - I took out all the lossy modes from uQ like quality score digitisation etc, because I don't really know what sort of effect that's going to have downstream. There's already lossy FASTQ tools out there that write to FASTQ. Then again, uQ could at least store summary metrics on the reads it removed in the header of the file, which would be better than the current invisible delete FASTQ currently has. I'm conflicted. Will decide based on the data tonight :)
dsrc (DNA Sequence Reads Compressor), a very high performance one, could be added for the comparison.
Thanks shenwei - added :)
Heng Li had mentioned fqzcomp from this paper. If you are getting set up to do comparisons.
fqzcomp was already in the comparison, but I haven't tried fastqz yet! :)
More tools :P
So far it appears uQ results in FASTQ files smaller than all of the above tools, although doing a strict comparison is not easy as some tools have different use-cases and bonus features. But for just smallest file size, I think uQ + vanilla LZMA is the smallest. I'm sure one could tune the LZMA settings to get even better compression, but at these ratios where we're talking about a hand full of Megabytes, so who cares :) Mainly I just wanted to show that if FASTQ was uQ to begin with, then all of the publications regarding FASTQ compression wouldn't have been necessary to begin with.
Hi John,
Have you considered making a version that does both binary translation and compression? For example, what I would consider to be particularly useful is something like...
uq -in file.fq -out compressed.fq.uq -compressor gzip
or
uq -in file.fq -out compressed.fq.uq -compressor lzma
...or whatever. Where the compressed.fq.uq stores the codec in its header so you can just run something like:
uq -d -in compressed.fq.uq -out original.fq
...and get the results you are expecting. Assuming, of course, the codec is either in the command line or present in the uQ package. I think those additions might generally make it easier to adopt into pipelines (and, if you're interested, I'd be happy to add support into BBTools if it can act like a traditional stream compressor like gzip, regardless of what it does under the hood).
I'm also wondering... assume that most fastq files can be stored in 3 bits per base. Have you tried packing them as 6 bits (2 bases) per byte instead of 2.666 bases per byte? I think most compressors are byte-oriented and won't understand information that spans bytes with different bit-alignment (so, ACGT would not share any compression with AACGT, for example, despite the shared sequence). I might be completely wrong, of course, so I'd like to hear your opinion.
Hey Brian! sorry for the 4 day delay, i have been frantically writing up this tool in a small section in my thesis, and i thought it would perhaps be best if I sent you the link to those pages so you can totally get it, since i know the README doesn't cover the specifics much, and the code is possibly the worst i've ever written (and you'll know why when you read the section).
It's a few pages but its got a lot of pictures. I will post it in like an hour or something. i'm writing this post really to force myself to get something out there. would also apprechiate feedback if you have the time - like if i say anything that isn't correct or something. that's all i ask :) I think that's MIT licence actually, haha.
Is uQ pair-aware? Or do I have to re-pair the files after converting them back to fastq?
This question made me download
uQ
. No mention of paired-end anything anywhere so it is likely not PE aware.@John: Can you add what python version is needed for
uQ
along with how to include/install CFFI module in the README?Hello both! The inclusion of the CFFI dependency from the latest github push has made me rethink how i'm going to recommend people get dependencies in the future. I'm going to write up a guide right now for the installation of CFFI/pypy as genomax recommends.
To answer your question h.mon, it is not truely pair-wise aware, because it is not fragment aware. It is, just like FASTQ, only read-aware, so it has all of the same limitations as FASTQ (I actually built uQ to sort of highlight that point - you can make a great encoding, but if your model is fundamentally wrong your format will not perform very well in real world examples).
On the otherhand, QNAME data in uQ is handled differently from FASTQ. In FASTQ it's a string, delimited by a newline. In uQ it's assumed all the QNAME data forms a non-sparse table, with some random delimiters. Most "big data" from SRA or Illumina/etc is like this. So the columns of data in all the QNAMEs are parsed, data is re-encoded, and put into a fixed-length array. Depending on whether you use --raw or not depends if that array is sorted by the sort of the file (original QNAME order), or sorted on binary encoding with a separate index array to retrieve the original order (see numpy.unique for details).
Bottom line, if you know how your QNAME encodes fragment info (same QNAME, /1 or /2 on the end, etc) you can go from read-DNA-sequence to mate-DNA-sequence pretty quickly without any special indexing, just binary searches. You can make this easier or harder to do depending on how you set up the uQ file. If you use "--sort QNAME --raw QNAME" when making the file for example, a single binary search in the QNAME table would find the indices of the QNAMEs, which can be used to pull out data from the DNA and QUAL tables. But even without --sort/--raw tricks, you could figure a way to do it quickly with just the tables and the sort order you have. It will never be as efficient as a fragment-centric format could, but it's a lot better than what you'll get with FASTQ. Resorting a 40Gb FASTQ file takes about 8 seconds in uQ, so it doesn't really matter how the data was originally encoded. You can start off by coercing the data as part of your app, which won't take long. That's the power of structs/numpy :)
The same techniques could be used in reverse to find QNAMEs that have a given DNA sequence. This could be used to map a single unique DNA sequence from the DNA table, then apply the mapping position(s) to all QNAMEs that use that DNA sequence (make a new column called start position, say). There's lots of trick you can do, once all your data is in a fixed-length table. And in total it takes 1/3rd to 1/4er of the space as FASTQ in RAM, decompressed.