Question

Forum:How do you organize sequence data (FASTQs)?

4

Entering edit mode

7.1 years ago

Daniel E Cook ▴ 280

I'm within a lab that has performed a lot of sequencing. We've sequenced a lot of samples, sequencing many over time, on different platforms (Illumina, PacBio), with different read lengths (50-150bp), with different library preparation methods (Tagmentation, Sonication), using different sequencing centers, and different types of sequencing (RNA-Seq, WGS, small RNA, etc).

At this point, we have terabytes of sequencing data and it's beginning to become unwieldy. I have a variety of questions surrounding how to manage sequence data and how to use it.

How do you organize your sequencing data?
What file structure do you use?
How do you retrain metadata regarding the sequencing data?
How do you backup your sequence data?
How do FASTQs get fed into your bioinformatic pipelines?

It would be great to get a discussion going in this regard. I haven't seen too many questions on this subject (please direct me if I am mistaken!), but I imagine it is a problem many have to deal with.

Thanks!

RNA-Seq sequencing • 3.0k views

ADD COMMENT • link updated 24 months ago by Ram 45k • written 7.1 years ago by Daniel E Cook ▴ 280

1

Entering edit mode

This discussion can be broadly split into two categories.

Individual labs (like @Daniel's)
Core facilities

Requirements for these two are going to be very different. So it may be best to include in your answer which category you are referring to.

ADD REPLY • link 7.1 years ago by GenoMax 151k

0

Entering edit mode

How do you backup your sequence data?

Use NCBI SRA as your backup (in addition to other things, tape, disks, cloud etc)? Let NCBI take care of it. Even unpublished data can be uploaded under an embargo until the publication is out.

ADD REPLY • link 7.1 years ago by GenoMax 151k

0

Entering edit mode

see also this old post How Do You Manage Your Files & Directories For Your Projects ? (8 years ago ?!)

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 166k

score 1 · Answer 1 · 2018-04-20

How do you organize your sequencing data?

Renaming the files works best for me. For e.g.

Raw file: ABC_XXX_R1.fq.gz
Renamed : Project_001_Sample_001_Organism_abc_R1.fq.gz

This could be shortened and a metadata or README.txt file could accompany in the same folder

What file structure do you use?

A very basic:

├───project_01_Apr_2018
│       Project_001_Sample_001_Organism_abc_R1.fq.gz
│       Project_001_Sample_001_Organism_abc_R2.fq.gz
│       ReadMe.txt
│
└───project_02_May_2018
        Project_002_Sample_002_Organism_acc_R1.fq.gz
        Project_002_Sample_002_Organism_acc_R2.fq.gz
        ReadMe.txt

How do you retrain metadata regarding the sequencing data?

Storing information like

vendor
date of data generation
organism
application (WGS, 16S)
platform
chemistry
data size

How do you backup your sequence data?

External hard-disk. Project management software work best.

How do FASTQs get fed into your bioinformatic pipelines?

pre processing data
genome assembly
this really depends on case by case basis

score 1 · Answer 2 · 2018-04-20

1

Entering edit mode

7.1 years ago

Pierre Lindenbaum 166k

How do you retrain metadata regarding the sequencing data?

don't store fastq, use ubam and store whatever you want in the bam header and read-groups

* FASTQ must die! Long live SAM/BAM! *

https://blastedbio.blogspot.fr/2011/10/fastq-must-die-long-live-sambam.html

ADD COMMENT • link 7.1 years ago by Pierre Lindenbaum 166k

score 1 · Answer 3 · 2018-04-20

We don't really store fq anymore. Reads, aligned or unaligned, are stored in sorted bam. If for any rare reason we need access back to fq, we just convert the bam back.

How do you organize your sequencing data?

We store all genomic related data in AWS S3, grouped by unique ID for each project, i.e. projects/genomics/GE0001/sample1/sample1.sorted.bam where GE stands for Genomic Projects.

What file structure do you use?

We primarily use snakemake with wildcards to build DAGs, so it's natural to simply append the extension to indicate what the file is for. For instance, sample1.star.coding.altevents.a5.disease.txt indicates the data is processed via STAR, filtered for coding regions, and it contains alternative 5'ss events with disease annotations.

How do you retrain metadata regarding the sequencing data?

For the most part, plain old YAML, location in each genomic project directory.

How do you backup your sequence data?

We use S3 versioning. For internal data, we also have copies of fq or bam stored in local hard drive.

How do FASTQs get fed into your bioinformatic pipelines?

We run snakemake in AWS Batch environment. Data are permanently stored in S3, staged in to temporary storage during the execution of the workflow, and stage out to S3 when done.