I am wondering what is the most widely used method/program/tool to automate the launch of bcl2fastq program as soon as the run finishes off on the Illumina machine. I am about to start writing my own custom shell script for the same; but before I do that I want to know whether ready-made solutions are already available.
The run folder nomenclature is as follows:
My idea would be like this -
Keep looking for folder containing the "Instrument Serial No." in its name:
for every 15 minutes :
Look inside the run log file for something which says - sequencing complete
if YES:
Launch bcl2fastq
else PASS:
If someone has better idea which could be implemented, please do let me know.
Look for one of these files the FC folder. RTAComplete.txt (HIseq/MiSeq), SequenceComplete.txt, RTAComplete.txt/CopyComplete.txt (NovaSeq). Which signifies completion of the sequencing. That would be your signal to start processing/copying.
With MiSeq, SampleSheet.csv would have been provided at run start so should be available there. With other sequencers you will need to inject right SampleSheet.csv into the folder or source it from your LIMS/other location to start the analysis.
Note: Some sequencers continue to write data to other directories (HiSeq 4000, possibly NovaSeq) even when these files are seen. So to be safe add another hour before you start copying the data out/analyzing it.
CopyComplete.txt and SequenceComplete.txt files are generated on the external location that we have provided but not on the local hard disk installed in the sequencer. Is that expected? Both locations have the RTAComplete.txt files though.
It is good to know how you could automate automatic running of bcl2fastq, but I have encountered a few reasons why you may still need to run some base calling via command line:
1) If you have multiple library types (such as single-barcode, dual-barcode, 10X samples, custom UMI libraries, etc.). This means you may have to run bcl2fastq more than once, and/or run cellranger mkfastq instead of bcl2fastq.
2) Your original barcode information is not correct. With mixed library types, I think there is a decent chance that I have had to change a barcode after an initial step of base calling with bcl2fastq (before a second step to return user results). However, maybe this can vary between individuals. This tends to happen less often if you don't mix barcode types (such as a rapid run), but I think it can also happens more often if you have >50-100 samples of a given type in a run.
For example, I have withdrawn at least 1 record from the SRA because it was actually a mix of samples from different labs (one sample has the wrong barcode, and was mixed in the sample that had the right barcode).
3) You might realize that you need to either change some base calling parameters, or prefer to use non-default parameters (such as not allowing any barcode mismatches)
It is not directly relevant to the automation question, but I have a discussion about possible QC flags (while the solution / explanation can vary, this might indicate a need to slow down and processes fewer samples more carefully, but I am mostly putting some ideas out there to discuss):
This is where I mention that I use a non-default setting of allowing 0 barcode mismatches. However, to be clear, I am not advocating allowing more mismatches (or changing parameters to artificially increase the number of reads provided for a sample) - in contrast, I am trying to better understand when runs and/or lanes need to be thrown out due to quality concerns.
@Charles: This is not an answer to the original question. You should consider moving this to a comment on the original post.
Edit: Adding some more thoughts.
While you bring up valid exceptions, it would be reasonable to expect that someone trying to automate bcl2fastq runs will have back-end infrastructure (e.g. a LIMS) that is used to track samples and orchestrate the management and analysis.
We do run a similar system and yes there are errors at times but they generally can be dealt with after the automated analysis runs. Exceptions like cellranger demux runs could also be programmed, if you do enough of them to warrant the additional work that would be needed to account for them.
Coming back to my own question with an answer. After a lot of careful considerations and talking to the Illumina people, it turned out that generation of CopyComplete.txt (NovaSeq) is a good trigger to start bcl2fastq. Here is a little bit more information on that:
SequenceComplete.txt indicates that the sequencing run has finished.
CopyComplete.txt is created by the Universal Copy Service (UCS) when all files have been copied to their destinations and run completion signal has been triggered.
The RTAComplete.txt file indicates that the images that are generated by the system have been converted to Basecalls. The Basecalls are stored as .bcl files. These bcl files can then be used as input for BCL2Fastq to produce the fastq files.
I wrote a more generalized shell script back in the day to do the same. It actually just runs stat on the directory which displays the status of a file or filesystem. I did stat rather than looking for a file because locally we transfer our data from the sequencer to an hpc filesystem and sometimes files like RTAComplete.txt etc, would show up before all the files were done being copied over. The script was generalized as follows:
RUN_PATH=$1; shift
RUN_PATH=`echo $RUN_PATH | sed 's/\/$//'`
cd $RUN_PATH
OLD_STAT="initial"
while true; do
NEW_STAT=`stat -t $RUN_PATH`
if [ "$OLD_STAT" != "$NEW_STAT" ]; then
echo 'Directory is still updating'
sleep 2h
OLD_STAT=$NEW_STAT
echo 'Checking again.'
elif [ "$OLD_STAT" = "$NEW_STAT" ]; then
echo 'Directory is done updating. Move on.'
break
fi
done
echo 'The Loop Has Been Left.'
bcl2fastq -R $RUN_PATH -r 6 -w 6 -p 8 "$@"
It runs a status check on the run directory and stores the results, then checks every 2 hours. If the stat does not change, that means the directory is done updating and you can kick off bcl2fastq. Otherwise, you stay in the while loop.
To kick it off you just do ./dir_checker.sh /path/to/run/ and you may want to pass it to the background because it will occupy a terminal until it is done.
You might also consider rolling a solution using inotify or similar. There are lots of bindings for it for CLI use now, e.g: https://github.com/dsoprea/PyInotify
Look for one of these files the FC folder. RTAComplete.txt (HIseq/MiSeq), SequenceComplete.txt, RTAComplete.txt/CopyComplete.txt (NovaSeq). Which signifies completion of the sequencing. That would be your signal to start processing/copying.
With MiSeq,
SampleSheet.csv
would have been provided at run start so should be available there. With other sequencers you will need to inject rightSampleSheet.csv
into the folder or source it from your LIMS/other location to start the analysis.Note: Some sequencers continue to write data to other directories (HiSeq 4000, possibly NovaSeq) even when these files are seen. So to be safe add another hour before you start copying the data out/analyzing it.
CopyComplete.txt
andSequenceComplete.txt
files are generated on the external location that we have provided but not on the local hard disk installed in the sequencer. Is that expected? Both locations have theRTAComplete.txt
files though.If you are going to work from the external storage location then yes. We also do something similar.
I am in the same situation, and was thinking about a similar solution. So curious about the replies that you get on this post.
It is good to know how you could automate automatic running of bcl2fastq, but I have encountered a few reasons why you may still need to run some base calling via command line:
1) If you have multiple library types (such as single-barcode, dual-barcode, 10X samples, custom UMI libraries, etc.). This means you may have to run bcl2fastq more than once, and/or run cellranger mkfastq instead of bcl2fastq.
2) Your original barcode information is not correct. With mixed library types, I think there is a decent chance that I have had to change a barcode after an initial step of base calling with bcl2fastq (before a second step to return user results). However, maybe this can vary between individuals. This tends to happen less often if you don't mix barcode types (such as a rapid run), but I think it can also happens more often if you have >50-100 samples of a given type in a run.
For example, I have withdrawn at least 1 record from the SRA because it was actually a mix of samples from different labs (one sample has the wrong barcode, and was mixed in the sample that had the right barcode).
3) You might realize that you need to either change some base calling parameters, or prefer to use non-default parameters (such as not allowing any barcode mismatches)
It is not directly relevant to the automation question, but I have a discussion about possible QC flags (while the solution / explanation can vary, this might indicate a need to slow down and processes fewer samples more carefully, but I am mostly putting some ideas out there to discuss):
Calling Single-Barcode Samples from Mixed Runs as Dual-Barcode Samples | Possible Illumina Run QC Flags?
This is where I mention that I use a non-default setting of allowing 0 barcode mismatches. However, to be clear, I am not advocating allowing more mismatches (or changing parameters to artificially increase the number of reads provided for a sample) - in contrast, I am trying to better understand when runs and/or lanes need to be thrown out due to quality concerns.
@Charles: This is not an answer to the original question. You should consider moving this to a comment on the original post.
Edit: Adding some more thoughts.
While you bring up valid exceptions, it would be reasonable to expect that someone trying to automate
bcl2fastq
runs will have back-end infrastructure (e.g. a LIMS) that is used to track samples and orchestrate the management and analysis.We do run a similar system and yes there are errors at times but they generally can be dealt with after the automated analysis runs. Exceptions like cellranger demux runs could also be programmed, if you do enough of them to warrant the additional work that would be needed to account for them.
Thank you for the suggestion - I have accordingly converted the answer to a comment.