How to manage a growing library of genome sequences?
3
7
Entering edit mode
8.7 years ago

In our lab, members routinely generate NGS data, and we receive sequences from other labs. The sequences are stored on a NAS server. However, there is no centralized scheme for storing the sequences, so over time many sequences become "orphaned" - it's not clear which experiment they're from, when were they generated, or who generated them.

A low-tech approach (keeping a spreadsheet with the metadata) seems unenforceable and error-prone. It would be nice to have a tool that integrated with the NAS's file-system and stored meta-data regarding the sequences. Is there such a tool? And how do you manage your sequence libraries?

NGS • 2.2k views
ADD COMMENT
3
Entering edit mode
8.7 years ago

I've still not implemented this is my lab, but I plan to create a RDF(?) -based database of the files on my server.

Instead of using the filepath, I would use the sha1 checksum of the file.

A similar idea was to write a program that scanned the NGS files and produced a XML file that was a good starting point to get the current state of my files ; https://github.com/lindenb/jvarkit/wiki/NgsFilesScanner

<?xml version="1.0" encoding="UTF-8"?>
<ngs-files>
  (...)
 <vcf timestamp="1398643093000" file="/commun/data/projects/path/Samples/S2/S2.varscan.annotations.vcf.gz" filename="S2.varscan.annotations.vcf.gz" modified="Mon Apr 28 01:58:13 CEST 2014" size="21053412">
    <samples>
      <sample>S2</sample>
    </samples>
  </vcf>
</ngs-files>
ADD COMMENT
0
Entering edit mode
8.7 years ago

Hi- I think that's a good question and this my (current) solution.

I've set up database (PostgreSQL + django) with, among other things, a "fastqfile" table with column "filename" (primary key), "md5sum" column, and a "library_id" column which has a foreign key to a "libraries" table with information about each library.

In this way when I get new fastq file(s) I'm forced to give to each fastq a parent library, which in turn has to be described in the database. It works ok-ish. Things stay organized and I don't get orphan files or name collisions. Main issue, maybe, is that when I get a new file I need to spend some time doing the bureaucracy work to comply with the database requirements instead of going straight to the data analysis, but I guess this is somewhat inevitable.

ADD COMMENT
0
Entering edit mode

as Peter Cock suggested :

A nice idea would be to put your reads into an unsorted BAM instead of a standard FASTQ : one can add any number of metadata in the BAM header

ADD REPLY
0
Entering edit mode

See also the reply from the GATK developers confirming they use unmapped SAM/BAM in production, and recommend this over FASTQ in their Best Practices documentation:

ADD REPLY
0
Entering edit mode

I agree. I haven't yet done this myself as I am just getting things up and going for both my own lab and a clinical NGS service. I'm definitely planning on doing unsorted SAM/BAM for long-term retention of the "raw" sequence data versus the original FASTQ files. Being able to encode the metadata into the file itself is the main reason. Although I will also be running a samples database.

ADD REPLY
0
Entering edit mode

Hi Guys. Yes I agree that fastq could be replaced by bam. I'm not doing it since I don't want to add an additional step in the data processing (i.e. add more "bureaucracy") and some tools expect fastq anyway, but I would happily have sequencers spit out bams instead if fastq. I think at the Sanger Institute they produce bam instead of fastq by default. In any case, as Dan Gaston says, I would run a database.

ADD REPLY
0
Entering edit mode
8.7 years ago
DG 7.3k

I like database solutions myself, but, like the issue with spreadsheets, you have to find a way to enforce it and not just allow people to drop files on the server. One way to do it would be, depending on your NAS set-up, is to enforce file submission via a web interface as opposed to a command-line. So that the only way to get data on the server is by the web, and you're required to fill out appropriate metadata before submission. That can become quite cumbersome though. I think many of the off the shelf genomic data management solutions like Arvados are really geared towards sequencing centre scale storage and implementation on compute clusters, etc.

ADD COMMENT

Login before adding your answer.

Traffic: 917 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6