Question

Q:Data files naming schemes

2

Entering edit mode

5.9 years ago

Darked89 4.7k

I am getting a piles of fastq files with either generic (R12345.r1.fq) or plainly confusing (170811p1pt.r1.fq).

While storing md5, project name etc. helps a bit, I feel that without a massive scale rename I will not be able to make a sense of the results, or even get the results in the first place.

Do you guys require that such files have some labels (dna, rna, net) followed if needed by say wgs, exo, tg1 etc? Wet lab ppl are multiplexing samples and dumping sequencing folders with rather spartan Excel metadata. No LIMS, no consistent naming schemes.

I am renaming everything to stay sane (keeping CSV files with old_name, new_name, flowcell, machine_id, number of reads, run_date,).

I will be greatful for the suggestions how to improve it. CSV -->> DB with a frontend is obvious.

sequencing • 1.4k views

ADD COMMENT • link updated 5.9 years ago by Pierre Lindenbaum 166k • written 5.9 years ago by Darked89 4.7k

0

Entering edit mode

A naming scheme that would work universally is difficult to implement. If you deal with tens of thousands of samples for a large consortium project then short of a LIMS/DB nothing will work.

One of the issues we deal with in a core facility is people naming their samples Samaple_101, Sample_201 etc. While it makes perfect sense for them (a code if you will) it obviously causes issues on core end. A unique identifier that is automatically generated (that does not need to be human readable) is one way of avoiding this issue. Translation of the names can also be done on the fly (store the file with any name you want) your users will see the name they are familiar with on front end. This would only work if they are accessing results you produce indirectly (via a portal for example).

If more than just you needs to access/work on the data then implementing a proper tracking system would pay dividends in long term. Even after you leave.

ADD REPLY • link 5.9 years ago by GenoMax 153k

score 0 · Answer 1 · 2019-10-10

0

Entering edit mode

5.9 years ago

Pierre Lindenbaum 166k

For FASTQ, you can store everything as a UBAM file. Here you can put a description of the samples, date, project in the SAM header etc... https://gatkforums.broadinstitute.org/gatk/discussion/5990/what-is-ubam-and-why-is-it-better-than-fastq-for-storing-unmapped-sequence-data

ADD COMMENT • link 5.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I have tried that route, but got stuck at the downstream data processing. Meaning: if one does not implement the entire Broad pipeline, I still have to parse ubam's say RG info myself. Also being on the slow net I tend to shrink the data with clumpify from BBMap and pigz/pbzip2. I need to check if ubams are of comparay size. Last but not least: the renaming if done right permits brain dead mv -i pattern.fq.gz destination/

Less mental energy consuming than mv files in this or that file list or RG group. Vanilla users can do it and check that things are going ok.

Btw, what is the proper way to use bwa/star with ubam's as an input?

ADD REPLY • link 5.9 years ago by Darked89 4.7k