Question

How to manage a growing library of genome sequences?

7

Entering edit mode

8.7 years ago

vladimir.gritsenko ▴ 70

In our lab, members routinely generate NGS data, and we receive sequences from other labs. The sequences are stored on a NAS server. However, there is no centralized scheme for storing the sequences, so over time many sequences become "orphaned" - it's not clear which experiment they're from, when were they generated, or who generated them.

A low-tech approach (keeping a spreadsheet with the metadata) seems unenforceable and error-prone. It would be nice to have a tool that integrated with the NAS's file-system and stored meta-data regarding the sequences. Is there such a tool? And how do you manage your sequence libraries?

NGS • 2.2k views

ADD COMMENT • link updated 8.7 years ago by DG 7.3k • written 8.7 years ago by vladimir.gritsenko ▴ 70

score 3 · Answer 1 · 2016-02-28

I've still not implemented this is my lab, but I plan to create a RDF(?) -based database of the files on my server.

Instead of using the filepath, I would use the sha1 checksum of the file.

A similar idea was to write a program that scanned the NGS files and produced a XML file that was a good starting point to get the current state of my files ; https://github.com/lindenb/jvarkit/wiki/NgsFilesScanner

<?xml version="1.0" encoding="UTF-8"?>
<ngs-files>
  (...)
 <vcf timestamp="1398643093000" file="/commun/data/projects/path/Samples/S2/S2.varscan.annotations.vcf.gz" filename="S2.varscan.annotations.vcf.gz" modified="Mon Apr 28 01:58:13 CEST 2014" size="21053412">
    <samples>
      <sample>S2</sample>
    </samples>
  </vcf>
</ngs-files>

score 0 · Answer 2 · 2016-02-29

0

Entering edit mode

8.7 years ago

dariober 15k

Hi- I think that's a good question and this my (current) solution.

I've set up database (PostgreSQL + django) with, among other things, a "fastqfile" table with column "filename" (primary key), "md5sum" column, and a "library_id" column which has a foreign key to a "libraries" table with information about each library.

In this way when I get new fastq file(s) I'm forced to give to each fastq a parent library, which in turn has to be described in the database. It works ok-ish. Things stay organized and I don't get orphan files or name collisions. Main issue, maybe, is that when I get a new file I need to spend some time doing the bureaucracy work to comply with the database requirements instead of going straight to the data analysis, but I guess this is somewhat inevitable.

ADD COMMENT • link 8.7 years ago by dariober 15k

0

Entering edit mode

as Peter Cock suggested :

@lexnederbragt If I had a reddit account, I’d link to my old blog post https://t.co/8nK3wvETwn “FASTQ must die! Long live SAM/BAM!"
— Peter Cock (@pjacock) February 23, 2016

A nice idea would be to put your reads into an unsorted BAM instead of a standard FASTQ : one can add any number of metadata in the BAM header

ADD REPLY • link 8.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

See also the reply from the GATK developers confirming they use unmapped SAM/BAM in production, and recommend this over FASTQ in their Best Practices documentation:

@pjacock @OmicsOmicsBlog @lexnederbragt We use unmapped SAM/BAM in production, and recommend it in #GATK Best Practices.
— GATK Dev Team (@gatk_dev) February 23, 2016

ADD REPLY • link 8.7 years ago by Peter 6.0k

0

Entering edit mode

I agree. I haven't yet done this myself as I am just getting things up and going for both my own lab and a clinical NGS service. I'm definitely planning on doing unsorted SAM/BAM for long-term retention of the "raw" sequence data versus the original FASTQ files. Being able to encode the metadata into the file itself is the main reason. Although I will also be running a samples database.

ADD REPLY • link 8.7 years ago by DG 7.3k

0

Entering edit mode

Hi Guys. Yes I agree that fastq could be replaced by bam. I'm not doing it since I don't want to add an additional step in the data processing (i.e. add more "bureaucracy") and some tools expect fastq anyway, but I would happily have sequencers spit out bams instead if fastq. I think at the Sanger Institute they produce bam instead of fastq by default. In any case, as Dan Gaston says, I would run a database.

ADD REPLY • link 8.7 years ago by dariober 15k

score 0 · Answer 3 · 2016-02-29

I like database solutions myself, but, like the issue with spreadsheets, you have to find a way to enforce it and not just allow people to drop files on the server. One way to do it would be, depending on your NAS set-up, is to enforce file submission via a web interface as opposed to a command-line. So that the only way to get data on the server is by the web, and you're required to fill out appropriate metadata before submission. That can become quite cumbersome though. I think many of the off the shelf genomic data management solutions like Arvados are really geared towards sequencing centre scale storage and implementation on compute clusters, etc.