Entering edit mode
5.8 years ago
James Ashmore
★
3.5k
Given a directory containing multiple FASTQ files, I would like to retrieve the run information from each file (e.g. flowcell, lane number, index e.t.c) and export the data to a CSV file. Before I write a script to do this myself, is anyone aware of any software which does this already?
Writing a one-liner than does this should take less time than it did to write your post. This is homework, isn't it?
Edit. considering your rep, maybe it isn't..
This isn't homework. There seems to be a surprising number of edge-cases depending on which version of CASAVA was used to convert from BCL to FASTQ format.
How is the required information encoded in the FASTQ headers, and is it at all? I am not sure if there is standard format for fastq headers. The information might be available as meta data from your sequencing provider, it might also be encoded in the file name. If you could provide an example of your filenames and headers, someone might be able to help you with a quick sed|grep|awk script.
While there isn't a really a standard, read names from illumina machines have a more or less common format as long as the provider hasn't changed them. I guess it would be useful to extract things like the machine model and id, the folow cell id, match those things up to databases of what the strings mean (i.e. identify file one as coming from a HiSeq 2500 and file two as coming form a NovaSeq etc...).
Sounds like it might be quite useful, but I've not seen a tool that does it before.
In my case I want to create read group information as explained by GATK by automating the extraction of run information and creating the read groups for each run/sample.
I think Illumina has a standardized format
So basically it's something like (maybe not exactly, I don't know if there are more than 1 lanes, tiles, index, whatever in one file..):
find /some/place -maxdepth 1 -name "*.fq" | xargs -I {} awk -v n="{}" 'BEGIN{FS=":";OFS=","}NR==1{print n,$2,$3..}' {}