Hi there folks, I would like to share with you a little utility that I built just a couple of days ago. Its main purpose is to convert a bunch of illumina fastq files into a single sqlite3 database without pain. I am new in bioinformatics and not really sure whether this way of reads processing make sense for biologists, but this code actually helped my friend to build custom processing pipeline on top of sqlite3 database. So any feedback would be appreciated, thank you)
Questions would be summarized and answered right here.
I don't mean to rain on your parade, but I see two problems:
Having fastq reads in a sqlite database doesn't make much sense from a workflow perspective. There are dozens of tools that operate on standard fastq files, and really, those fastqs are just an intermediate step for storing data until it's aligned into a bam file.
Namespace collision: If you're on MacOSX, there is already a very popular program called 'brew' for installing packages (see this)
So kudos to you for writing some bioinformatics code and releasing it publicly! Next time, maybe examine the problem space a bit more before spending too much time on code
Thank you Chris for your feedback, really appreciated. The main purpose of brew is to create backend for a custom filtering - processing, without any existing utilites. Brew yields data in easily accessible form, so you can write cleaner code to manipulate it. The closest analogy I can imagine is RAW format produced by hi-end cameras. This format contains signal from sensor as is in easy-to-manipulate form. About name collisions - you are right, I`ll fix it.
It seems that what your tool does is flattens three files into one like so:
Id Read1 Read2 Barcode
so why not just put this info into a single flat file and have higher performance with both storing and querying it? What benefit does Sqlite offer over this?
Why would you even create three tables with foreing keys etc when all the information would fit into one table?
ADD REPLY
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
John
13k
0
Entering edit mode
Excellent question) Idea was to store lots of fastqs in the same db and have complex cross links between them. Also the idea of SQL VIEW is very promising - instead of having "hardcoded" tables you can have code driven virtual ones - so called VIEWs, it is very convenient way of interpretation and does not consume additional space.
How complex do you expect the connections among FASTQs generated for different experiments are going to be? Views are a nice concept, but what they save in terms of space they consume in terms of compute. Normalization (the RDBMS concept) applied to FASTQ is additional burden leading to very low gains - the RoI is not that good. I'm still trying to understand what need SQL-izing FASTQ could possibly serve.
You use fastq but you dont keep the base quality data? Why's that?
Since you're writing in python, you could use this up2bit format to store the DNA and INTs rather than VARCHAR :)
SQLite is really bad for anything parallel. If you want to store data in a verbose format for the benefits of the SQL, you might as well go all-out and use something like Postgres for the parallel-read/write, better indexing, remote access, etc. SQLite databases are usually significantly smaller than Postgres though.
Instead of recording a barcode for each read and JOINing, it would be better to make a table for each unique fastqFile-barcode combination. A search for reads with barcode X would be performed by only accessing table names ending with -X.
It would probably be better to put the R1 and R2 in the same table too, so you dont double up on "read_identifier"s unnecessarily, plus they're typically be accessed at the same time anyway, so half the lookups.
Your friend is doing something strange... you guys should have a talk about it.
I also agree with Chris on this that 'brew' is already taken in my brain - why not FASQL or something -_-;
ADD REPLY
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
John
13k
0
Entering edit mode
You can parse lots of data from fastq files, quality strings as well.
Hmm, that is interesting idea. Since you have only four chars in alphabet, two bit would be sufficient to encode a char.
Surely postgres would be faster, but I need "no server" solution. And also sqlite`s db can be send to another person or shared somehow. About read performance - it is up to application, if you dont write to db, you can actually have several read workers in parallel.
This utility doesnt make any assumptions about your data, so there is no such thing as barcode in terms of brew. All data presented in the same format of table, interpretation aspects belongs to application level.
Again, the idea was not to make any assumptions about data itself, any presentation logic should be described in "code" level - sql or in terms of some ORM.
We are interested in arbitrary data manipulation by hand written code, that is why we need decent storage.
Any kind of filtering by predicate would be much faster in database then in your application code.
JOINs obviously.
Fast VIEWs that you can stack one to another.
Besides you dont need to think about this db as sql db. Connect to it from python with some ORM, like sqlalchemy and you will get object layer mapped to db data.
I've seen a lot of filtering involving BAM files, very few involving FASTQ. SQL is like a medical drug patent - the first pill costs millions of dollars, the second pill costs exponentially less. Or, the SQLite creation is expensive, the first query of any kind is expensive, but as indexes are built, subsequent queries get faster. With well-established FASTQ indexing algorithms, plus the fact that FASTQ is plain text while SQLite is not, I don't see the immediate benefit.
You had a whole lot of great questions and critics that I would like to summarize. So for me the main reason to use database is an attempt to avoid data mapping in application level, try to map forward reads to reverse reads to bar codes in your python, c++ or matlab code in memory. Sqlite in this case is an optimized storage and query engine, it should not know anything about the nature of your data (that is why all files represented in form of the same table), but it must be able to join/filter your data as fast as possible.
Why not single table?
Because you can have arbitrary amount of converted files in your db (from different experiments for example) and you dont even have to add all of them at the same time. Making a single table out off files is equivalent to having certain assumptions about the structure of data, however one of the key ideas was to avoid it. Data interpretation should be done in application level - SQL`s virtual tables known as VIEWS or in terms of object-relation mappers in your favorite programming language.
Why sqlite3?
It is rather fast, it doesn't need server side, it accepts sql as query language, it can be manipulated via ORMs plus you can send or share database with your friend.
Why not flat file?
If you are using flat format - simple text file, you are extremely limited in processing capabilities - to find particular record you would need to lookup the whole file, sqlite uses b-trees and sophisticated caching techniques, so it is way faster.
Name collision.
Renamed to weld.
Why not use up2bit format to store sequence information?
There are two reasons -
two bits are not enough to encode five char alphabet - A, T, G, C and N (representation of "any" nucleotide); and
filtering of sequences by substring would be not native in this case and therefore less efficient.
What kind of task could be done in sqlite efficiently?
Any kind of filtering by predicate - for example demultiplexing.
Fastq is often gzip'd (several times smaller than your DB) and can be streamed efficiently (we rarely load a whole fastq into RAM). Your DB is mainly useful to pull a few reads by their names. For most other applications, it can't compete streaming on performance and simplicity. I would suggest you learn the current practice before defending your approach. People in this thread all know SQL/SQLite fairly well. We don't use it because it is often not as good as our current approach.
Ok, we appreciate that you are attempting to improve on a process and write a tool to handle FASTQ formats. Don't take the following (and the criticism in general) as discouragement.
It is just that you are solving a problem by first oversimplifying it - hence the many opinions and somewhat of a push-back. For example you say how sqlite is fast and has a query language - well it is fast when it operates on indexed columns, but you are not actually doing that. Your barcode lookup for example is slower than grepping through a file. You could of course add an index but then your single insert speeds would be terrifyingly slow. Now you'd need to solve it as a bulk insert problem. All doable and that would have value. What has less value is thinking that just by puttting some data into a database immediately solves a large suite of problems that people typically encounter.
Now that I looked at it your program it is more of a Illumina read metadata storage system. The tools seems to parse out and store a lot of information from the Illumina read naming scheme itself. That may have more value from some QC point of view perhaps and would be something that most tools don't actually do, hence could be built into something more valuable.
Ok, how to grep forward.fastq, reverse.fastq and barcode.fastq to actually obtain file, containing triplets like forward-reverse-barcode? Show me the code, Sir) or maybe awk or sed can do that somehow, please enlight me, do not hesitate. Performance is not THAT big deal here, just clone the code and try it, the build process is actually take a while but it worth it. The problem of bulk inserts is already solved, see the code. With the weld (now the utility has name weld, yes) you can parse any information declared by standard, but the resulting db is huge. Summary: db structure even without indexes is designed for high performance, at least way higher than grep can offer in tasks where you need not only find something but on top of it map things to each other.
I have not seen people put barcode in a separate file. For forward and reverse reads, they conventionally have the same number of lines: the i-th read in forward.fa is paired with the i-th read in reverse.fa. You just read the two files in parallel. You may also use paste like paste forward.fa reverse.fa | grep -A3 ^@read_name or create an interleaved fastq stream. If your forward.fa and reverse.fa are not properly paired, the upstream process is doing something wrong.
ADD REPLY
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
lh3
33k
0
Entering edit mode
Mmmmm... the data I was given for some reason always represented as three separate files - forward, reverse and barcode. Essentially putting them together was the problem. Official documentation claims that output files should have following name pattern <sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz. In my case I got 2 sequence files with 6 barcodes and barcodes themselves in separate file. Should I consult with my data provider?
You only get three files when the experimenter makes use of nonstandard barcodes. In the vast majority of situations when standard barcodes are used the instrument already separates the samples right away into files labeled by the samples (barcodes) and the barcode files themselves are not even distributed to the end user.
When the barcodes are more special and custom made (for example some people use longer or shorter or uneven length barcodes) at the end of the BCL conversion you will get three files - these all should be in the same order.
As Heng has mentioned already the files are always ordered the same way hence getting the Nth pair means simply iterating in parallel on all input files. Your example usage of getting triplets from each file can be done by chaining a few command line tools or when solved via Python in 10 lines of code if you want write it readably (probably a one liner in Perl ;-) )
This discussion though is enlightening in sort of making one think about what could be stored about sequencing reads if these were say longer like the MinION reads where there could be a lot more information about each read. FASTQ databases like this (of course going beyond just simple file storage) are not on our radar when the reads are short and come in immense quantities. Once the meta information in sequencing reads becomes richer and more valuable databases might be viable. I would suggest to look into that. In the case of MinION they even have a HFD format to start with. Amusingly each read is stored as a separate database! It would be nice to be able to query for say high quality reads, or reads over a certain length etc.
Personally, I think SQL for raw data is kind of a bad idea - its much better suited for processed categorical data and/or highly relational data.
Having said that -- i'm going crazy right now with writing BAM files over and over and over.
Want to mark duplicates on a 100Gb BAM? You'll have to write another 100Gb BAM.
Want to realign indels on a 100Gb BAM? You'll have to write another 100Gb BAM.
And pretty much any other BAM operation that involves modifying the data. I think to follow the GATK best practices on 50 samples, i've written probably a Petabyte of data to disk so far. Obviously i've had to write over old data, because I don't have a petabyte hard drive. The point is if the reads where stored in a structured database, I could modify individual reads without having to copy all the other data too.
Just a thought, for maybe your next project :)
If you are considering doing pileup (wig/bed) data to SQL, I did that about two years ago and the results were pretty lackluster. Here each column is a sample, and each row is a base.
Of course, grabbing signal from a specific region over multiple files is really fast - about 2x faster for 10 columns than from 10 individual bigwigs, and the gains increase the more samples you grab data from at once.
But the size of the data is really significant. In the chart below, the c3 format is essentially a standard constant-step wig (stored in a binary format), and v2 is a variable-step binary format similar to BigWig but without the indexing. As you can see, SQL doesn't really offer a size improvement on constant-step files until you get way up into the 50s - and it will never be anywhere near as small as the variable-step formats.
So what you gain in speed, you loose 10x fold in size. You might be thinking "oh, well thats a fair trade off..?"
Well no, because if you load 50 v2 into memory, they fit just fine - and they can be accessed 1000x faster than SQL.
So the moral of the story is - SQL is great for compatibility, and great for when data is being constantly updated -- but truthfully its the tiny binary formats that win all the races. This is why, after all, FASTQ, BAM, BigWig, etc all exist in the first place.
BTW, with pyBigWig (and libBigWig), you can easily create variable step bigWig files that lack zoom levels. They still have an index, but the file size is cut down substantially. Note that IGV et al. will then break, because they require zoom levels...but if you're just interested in interval:value associations then that's not an issue.
I would say writing BAMs multiple times is the problem with the best practice. MarkDuplicate can be streamed on unsorted BAM. IndelRealigner is useless for GATK-HC and can also be streamed on sorted BAM. Variant calling can be streamed. BQSR collection can be streamed pre-sorting and then applied with sorting (I actually prefer not to apply BQSR). The only thing that can't be streamed is sorting. In all, if you do that right, you only need to write temporary files once for sorting. Reimplementing the whole pipeline in the right way can be times faster. The best practice is outdated given improved raw data, new algorithms and computer hardware.
On mutable BAM, there is BioHDF. It supports alignment editing while achieving a size comparable to BAM. GAP5 also has an alignment format supporting editing and compression. It has better performance than BioHDF.
In your SQL, are you merging all samples in one table? That is cheating ;-) I bet when you design a multi-sample Wig, the accessing time will be better than SQL, probably by far. How to achieve variable size will be an issue, though.
Oh - thats interesting. I never thought about re-writing their pipeline because, you know... "best practice" :P
But its a good point - I dont really see why I couldnt have streamed everything up until that alignment command + merge, and then all the indel stuff and beyond. I saw Pierre say he even processes chromosome individually to get better parallization, so seems like there's lots of tricks that people are using ;-)
Someone should make a blog post called "Better Best Practices" with the shellscript to just go FASTQs in, to BAM + gVCF out :P Having said that, the Best Practices are nice because they walk you through what each command does.
I know nothing about GAP5 or BioHDF. I remember reading the BioHDF paper and thinking it was neat - but without widespread adoption I clearly partitioned it into "interesting but not yet important" and then forgot all about it. I really wish people wouldn't publish new formats without giving at least 1 practical example/pipeline. It feels like such a shame. Just because a format is superior, doesn't really mean anything if you can't do anything with it :-/
And yes i totally cheated by having the SQL only have 1 POS column and the binary wigs/bigwigs/etc having that multiple times for each file (but that was exactly what i was trying to show in this slide/chart). The "c" and "v" binary formats do actually support multiple sample columns (but no indexing since they are supposed to be kept in memory and bisected). For the constant interval format it obviously saves a lot of space (a 4-byte column per sample), where as for the variable interval format, joining two datasets pretty much always resulted in a file bigger than the sum of the individual files alone, which I found interesting. Even for chip tracks where peaks overlay in multiple samples. I suppose the background noise accounts for 90% of the data unless you bin the signal intensity...
ADD REPLY
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
John
13k
0
Entering edit mode
I think Brad Chapman's whole blog could be retitled, "better best practices for variant calling" :)
Though tools don't support it, one could in theory modify uncompressed BAM files in place. Indel realignment would be somewhat difficult, but that's not needed with the haplotype caller anyway.
Uncompressed BAM as in SAM? I think you're right though - some fields in the BAM are always a certain number of bits long (like the flag column) so it should be modifiable I guess. I don't really know much about how the BAM is encoded. "import hts" :P
Maybe those other formats (CRAM/etc) can do it? Would be a nice selling-point if sometools could do in-place modification.
And you're right about Haplotype Caller not even needing any of this stuff anymore. I found that out just now after doing all the processing -_-; Still, the perfectionist in me would feel uncomfortable having base quality scores that aren't perfect even if they're never used, hahah! ( or maybe that's the time-waster in me :~| )
ADD REPLY
• link
updated 4.8 years ago by
Ram
44k
•
written 8.8 years ago by
John
13k
1
Entering edit mode
An uncompressed BAM file is a BAM file that's not compressed, not a SAM file :)
I don't mean to rain on your parade, but I see two problems:
So kudos to you for writing some bioinformatics code and releasing it publicly! Next time, maybe examine the problem space a bit more before spending too much time on code
Thank you Chris for your feedback, really appreciated. The main purpose of brew is to create backend for a custom filtering - processing, without any existing utilites. Brew yields data in easily accessible form, so you can write cleaner code to manipulate it. The closest analogy I can imagine is RAW format produced by hi-end cameras. This format contains signal from sensor as is in easy-to-manipulate form. About name collisions - you are right, I`ll fix it.
It seems that what your tool does is flattens three files into one like so:
so why not just put this info into a single flat file and have higher performance with both storing and querying it? What benefit does Sqlite offer over this?
Why would you even create three tables with foreing keys etc when all the information would fit into one table?
Heheh, I'm glad someone said it :P :)
Excellent question) Idea was to store lots of fastqs in the same db and have complex cross links between them. Also the idea of SQL VIEW is very promising - instead of having "hardcoded" tables you can have code driven virtual ones - so called VIEWs, it is very convenient way of interpretation and does not consume additional space.
How complex do you expect the connections among FASTQs generated for different experiments are going to be? Views are a nice concept, but what they save in terms of space they consume in terms of compute. Normalization (the RDBMS concept) applied to FASTQ is additional burden leading to very low gains - the RoI is not that good. I'm still trying to understand what need SQL-izing FASTQ could possibly serve.
-X
.I also agree with Chris on this that 'brew' is already taken in my brain - why not FASQL or something -_-;
> but I need "no server" solution. And also sqlite`s db can be send to another person or shared somehow.
you're looking for hdf5 https://en.wikipedia.org/wiki/Hierarchical_Data_Format
...
what kind of analysis requires fastqs in a sqlite3 db ?
Besides you dont need to think about this db as sql db. Connect to it from python with some ORM, like sqlalchemy and you will get object layer mapped to db data.
I've seen a lot of filtering involving BAM files, very few involving FASTQ. SQL is like a medical drug patent - the first pill costs millions of dollars, the second pill costs exponentially less. Or, the SQLite creation is expensive, the first query of any kind is expensive, but as indexes are built, subsequent queries get faster. With well-established FASTQ indexing algorithms, plus the fact that FASTQ is plain text while SQLite is not, I don't see the immediate benefit.
You had a whole lot of great questions and critics that I would like to summarize. So for me the main reason to use database is an attempt to avoid data mapping in application level, try to map forward reads to reverse reads to bar codes in your python, c++ or matlab code in memory. Sqlite in this case is an optimized storage and query engine, it should not know anything about the nature of your data (that is why all files represented in form of the same table), but it must be able to join/filter your data as fast as possible.
Why not single table?
Because you can have arbitrary amount of converted files in your db (from different experiments for example) and you dont even have to add all of them at the same time. Making a single table out off files is equivalent to having certain assumptions about the structure of data, however one of the key ideas was to avoid it. Data interpretation should be done in application level - SQL`s virtual tables known as VIEWS or in terms of object-relation mappers in your favorite programming language.
Why sqlite3?
It is rather fast, it doesn't need server side, it accepts sql as query language, it can be manipulated via ORMs plus you can send or share database with your friend.
Why not flat file?
If you are using flat format - simple text file, you are extremely limited in processing capabilities - to find particular record you would need to lookup the whole file, sqlite uses b-trees and sophisticated caching techniques, so it is way faster.
Name collision.
Renamed to weld.
Why not use up2bit format to store sequence information?
There are two reasons -
What kind of task could be done in sqlite efficiently?
Any kind of filtering by predicate - for example demultiplexing.
Fastq is often gzip'd (several times smaller than your DB) and can be streamed efficiently (we rarely load a whole fastq into RAM). Your DB is mainly useful to pull a few reads by their names. For most other applications, it can't compete streaming on performance and simplicity. I would suggest you learn the current practice before defending your approach. People in this thread all know SQL/SQLite fairly well. We don't use it because it is often not as good as our current approach.
Ok, we appreciate that you are attempting to improve on a process and write a tool to handle FASTQ formats. Don't take the following (and the criticism in general) as discouragement.
It is just that you are solving a problem by first oversimplifying it - hence the many opinions and somewhat of a push-back. For example you say how sqlite is fast and has a query language - well it is fast when it operates on indexed columns, but you are not actually doing that. Your barcode lookup for example is slower than grepping through a file. You could of course add an index but then your single insert speeds would be terrifyingly slow. Now you'd need to solve it as a bulk insert problem. All doable and that would have value. What has less value is thinking that just by puttting some data into a database immediately solves a large suite of problems that people typically encounter.
Now that I looked at it your program it is more of a Illumina read metadata storage system. The tools seems to parse out and store a lot of information from the Illumina read naming scheme itself. That may have more value from some QC point of view perhaps and would be something that most tools don't actually do, hence could be built into something more valuable.
Ok, how to grep forward.fastq, reverse.fastq and barcode.fastq to actually obtain file, containing triplets like forward-reverse-barcode? Show me the code, Sir) or maybe awk or sed can do that somehow, please enlight me, do not hesitate. Performance is not THAT big deal here, just clone the code and try it, the build process is actually take a while but it worth it. The problem of bulk inserts is already solved, see the code. With the weld (now the utility has name weld, yes) you can parse any information declared by standard, but the resulting db is huge. Summary: db structure even without indexes is designed for high performance, at least way higher than grep can offer in tasks where you need not only find something but on top of it map things to each other.
I have not seen people put barcode in a separate file. For forward and reverse reads, they conventionally have the same number of lines: the i-th read in forward.fa is paired with the i-th read in reverse.fa. You just read the two files in parallel. You may also use paste like
paste forward.fa reverse.fa | grep -A3 ^@read_name
or create an interleaved fastq stream. If your forward.fa and reverse.fa are not properly paired, the upstream process is doing something wrong.Mmmmm... the data I was given for some reason always represented as three separate files - forward, reverse and barcode. Essentially putting them together was the problem. Official documentation claims that output files should have following name pattern
<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz
. In my case I got 2 sequence files with 6 barcodes and barcodes themselves in separate file. Should I consult with my data provider?You only get three files when the experimenter makes use of nonstandard barcodes. In the vast majority of situations when standard barcodes are used the instrument already separates the samples right away into files labeled by the samples (barcodes) and the barcode files themselves are not even distributed to the end user.
When the barcodes are more special and custom made (for example some people use longer or shorter or uneven length barcodes) at the end of the BCL conversion you will get three files - these all should be in the same order.
As Heng has mentioned already the files are always ordered the same way hence getting the Nth pair means simply iterating in parallel on all input files. Your example usage of getting triplets from each file can be done by chaining a few command line tools or when solved via Python in 10 lines of code if you want write it readably (probably a one liner in Perl ;-) )
This discussion though is enlightening in sort of making one think about what could be stored about sequencing reads if these were say longer like the MinION reads where there could be a lot more information about each read. FASTQ databases like this (of course going beyond just simple file storage) are not on our radar when the reads are short and come in immense quantities. Once the meta information in sequencing reads becomes richer and more valuable databases might be viable. I would suggest to look into that. In the case of MinION they even have a HFD format to start with. Amusingly each read is stored as a separate database! It would be nice to be able to query for say high quality reads, or reads over a certain length etc.