Moving from previous versions of Illumina's CASAVA pipeline to CASAVA 1.8, there are a number of changes in terms of input, output and file structure.
While some of these new 'features' are meant to make life easier for post-pipeline analysis (like the new project / sample directory structure), other less-heralded changes look like they were done, in my opinion, to make life easier for Illumina and not Illumina's customers.
My question is for those groups that have moved up to CASAVA 1.8 (or are looking at doing so) what kind of post processing do you do to the data files generated by this pipeline? Have you found any 'gotchas' in terms of changes new to 1.8? Are you automating the running of this pipeline and post-run functionality?
From A previous BioStar question, and from what I've heard, most people don't just use the stock CASAVA pipeline for all their analysis. But I have not seen much in terms of how people are wrangling the pipeline's awkward and ever-changing folder structure and output formats.
Specifically, from what I've seen, there are a few new changes in 1.8 that demand some post-run attention:
- The fastq files for a specific lane / index are broken up based on file size and so must be concatenated back together.
- As the directory structure for how these fastq files are organized is now provided by the user, care needs to be taken to ensure you are concatenating the right files (in the right order?).
- The reads that don't pass filter are still present in the unaligned fastq files.
- As I understand it, this was not the case previously - and so post-run analysis now needs to be aware of and deal with these bad reads, or they need to be removed before post-run analysis begins.
- The CASAVA documentation suggests using zcat and grep to exclude filtered reads. Here's the bash script they suggest:
cd /path/to/project/sample mkdir filtered for fastq in *.fastq.gz ; do zcat $fastq | grep -A 4 '^@.* [^:]*:N:[^:]*:' > filtered/$fastq ; done
But this will produce invalid output (at least with our version of grep). Some problems with code this include:
-A
prints additional trailing lines. So you want 3 trailing lines in addition to the first line matched-A
also adds a "--" between contiguous matches, so you need to remove those- You are redirecting this to a file named [something].fastq.gz - but because you just expanded it with zcat it is no longer a .gz file!
- This causes tools that use the file name to determine operations (like fastqc) to fail.
I highlight this change to show the errors that can be introduced with this kind of post-processing, which is one of the reasons I'm asking if others are running into these kinds of issues.
So what are people doing about this? Are proprietary LIMS systems already on the ball and all these kinks are worked out? Are these changes too trivial to be mentioned publicly? Is no one else moving to CASAVA 1.8 yet?
Thanks
Jim
+1, I was just to ask a similar question just now ;-)
As an aside, here is a broad overview of the current automated pipeline we have with 1.8: 1) Create config.txt and SampleSheet.csv from LIMS data 2) Convert raw unaligned reads to Fastq format and Demultiplex lanes 3) Perform alignment to reference genome with ELAND 4) Aggregate and rename unaligned reads 5) Remove reads that do not pass filter 6) Analyze unaligned reads using fastqc 7) Aggregate and rename export files 8) Distribute data to project directories 9) Distribute stats and quality control analysis to permanent storage