Hi to everyone, thanks for your kind support.
I am a biologist (geneticist more specifically) working at the molecular diagnostics lab, who was tasked with building a bioinformatics pipeline for Illumina's TrueSeq Custom Amplicon panel designed to be used as a preconception carrier screening analysis.
The story started a year ago. I was a complete newbie with no background in bioinformatics. I've gained a little experience since then, I guess, but I had not so much time to deal with the task. It's only about 3 months ago when building the pipeline became my main work - until then, I was mainly working in a wet-lab trying to learn bioinformatics somewhen in between and in my free time.
By now, I've learned a minimum minimorum of using Linux and Python (not mentioning Galaxy), and have completely dived into (often contradictory) recommendations from learning resources, forums, benchmarking articles and guidelines regarding building the pipeline and interpreting results. I have a draft unfinished analysis pipeline in need of fine-tuning and adding a few steps using not yet tested by me software.
But it's hard for me to assess my own progress - I'm a self-learner with no tutor or learning mates to compare my progress to others'. I'd like to hear the stories of other self-learners: how long did it take to build a working pipeline, where did you seek help, how did you build your everyday work (and how did you manage not to give up?)
Any thoughts are welcome.
Since you have mentioned "molecular diagnostics" in your post there will be special considerations that you should pay particular attention to (especially if this is for human samples). You may have a "linear pipeline" where a set of sequence files goes and out comes a VCF file or some other defined result at the other end.
You would need to pay attention to strict validation (use a "modules" system for software version management, if you are using a shared compute system) of the components of the pipeline. Ensure that your have controls built in that guarantee reproducible results each time the pipeline is run. This will be an important consideration if the operating system/programs (e.g. Java) are automatically updated by systems administrators outside your control. SOP documentation will be needed to ensure that anyone else using your pipeline understands how the pipeline works (and can actually make it work in your absence). Including appropriate logging of what actually happened should not be overlooked since you would need to have that information on record. If a step generates an error/warning then it needs to be flagged so a human (you) can take a look at it. You may want to have the pipeline stop should that happen.
I am not sure if you have thought about these things while building the pipeline. These would be essential to consider for long term stability/regulatory compliance.
Thank you for your time! Thanks for pointing me to Environmental Modules, I'll try to figure out how this works. I suppose my systems administrator could help with logging. I have a naive question: what are the built-in controls and how to use them?
Your system admin will need to set the modules up. Any pipeline you set up may need to keep using defined versions of software packages it was validated with (again only critical if you are doing human diagnostics) since changing anything in the pipeline may require re-validation, updates to SOP documentation, you get the idea.
You can log data at multiple levels. Some of your programs may produce their own logs and you can capture stdout/stderr streams to files (if you are not using a job scheduler).
You can standardize with a set of control/test samples which should produce identical (expected) results each time you make any changes to the underlying pipeline. If wet bench people run internal controls alongside samples then they could be used instead.