I'm looking for a system to store human NGS data and metadata, and to retrieve data. We have a storage server with a proper distributed filesystem (Isilon OneFS).
There are some other posts discussing this topic, for example:
- A: How Do You Store And Share Your Bioinformatics Data? (Fasta, Fastq, Sff, Etc.)
- A: Using Hdf5 To Store Bio-Data
- Storage Solutions For Genomic Research Centers
But I wanted to make a new post because (1) those posts are several years old, and I imagine practices are different today, and (2) they discuss file formats and distributed file systems a lot, while I'm more interested in ways to access data.
I would like to have a system, preferably with a GUI (browser is also fine), where I can search for an individual (pseudonym ID), and retrieve their data:
- Raw NGS data (FASTQ)
- Aligned reads (BAM)
- Variants (VCF)
- Metadata, for example whether the individual is part of a trio, was the individual sequenced more than once, how was the individual sequenced, etc.
I also want to be able to retrieve data (VCF or BAM or whatever is specified) from a list of individual IDs.
Some nice-to-haves:
- Retrieve variants from individual lists in specified gene(s), loci or type of variation.
- Incorporating genome browsers such as ExAC.
- Or a different kind of genome browser like IGV.
- Familial relationships, for example as in Family Genome Browser (FBG)
Some examples of software I am unsure of:
Any input on this topic would be greatly appreciated.
Are you willing to purchase or looking for a free solution?
Purchasing is an option.
While a system of this type sounds simple, getting freeware or an off the shelf commercial solution to fit your internal business practices can easily become a huge pain in the you know what. Most times this is because of unwillingness of locals to change their business practices/inability of map existing practices onto a ready-made solution. This is guaranteed to cause pain for many unless you have plenty of resources (i.e. developers) to throw at this.
Looking at your user profile you seem to be at an institution that is in this for the long term. So if you have internal developer resources, then putting a solution together that fits your needs (keeping very simple/realistic goals, which is extremely important) may prove to be the best solution.
Also take a look at this old thread: Is there a Lims that doesn't suck? Issues mentioned in that thread (unfortunately) remain current. But it does have useful information about various packages.
We are sort of building everything from the ground up, so I don't know how much we have to adapt to existing practices.
We don't have a lot of resources, so we can't expect to develop a complex system for ourselves.
If there is no suitable system we can buy, then perhaps we need to consider developing our own. But most likely I will have to do it myself, so it will have to be very simple.
Hello, any updates on this? I am currently exploring options for managing NGS data in our lab and they don't mind paying for a good software that can effectively manage, track, and retrieve data from local storage.
I think you should separate raw data from accessible information. The raw data can be stored in bam files for instance on a slow file system. The accessible data would be the SNPs, coverage etc. It should be pre-computed or computed on request and then loaded to the database. In my opinion there is no reason in having bam files accessible. You obviously narrow down your analysis results to pre-defined questions or request some time to generate the relevant data but the saving in fast storage is huge. If you are looking for a commercial solution you can check out SQREAM, they have (or at least had) dedicated solutions for systems like you described. Good luck
Good point, accessing sequence formats such as FASTQ and BAM will be rare, but not non-existent.
PathOS has some of the functionality you are looking for. You can search for patients (maybe their metadata?), VCFs are displayed, IGV is incorporated for aligned read display. Their paper here.
Also, molgenis and their NGS modules might be of use.
Thanks, Molgenis is exactly the kind of thing I need. It seems to have very advanced data management features, and is also geared towards biobanks.
However, it seems very complex, and being an open source and most likely government funded project, I'm not sure I can expect much in terms of stability and long-term support.
That is a given for pretty much all software. That is one of the reasons one is expected to pay for the value-add that a supporting entity guarantees, even though the software itself may be free.