Preface: I have only been working as a computational biologist for the last 5 months. Prior to this I was a Master's student in a Biology program and my thesis work was Ecology-based, looking at infection rates of a disease vector so take everything I say with a grain of salt. Everything on here is based on stuff I wish I had known or done prior to working at my current position.
Introduction
I frequent /r/bioinformatics and Biostars quite often and one of the most common type of question seems to stem from 'How do I get started with Bioinformatics / Computational Biology'? The answer is not as simple as the question. Bioinformatics is a broad term that encompasses several fields from genomics to algorithm development and software creation. There is no step by step guide to preparing you for all fields of bioinformatics. Therefore it is important to understand that this 'guide' (which some might see more as a blog) is highly focused piece based on my experience coming from a completely wet-lab background to a microbiology NGS heavy lab.
What is bioinformatics?
Bioinformatics as defined is "the science of collecting and analyzing complex biological data such as genetic codes". Wikipedia defines bioinformatics as an 'interdisciplinary field that develops methods and software tools for understanding biological data' that combines 'computer science, statistics, mathematics and engineering' in order to interpret biological data.
Loosely bioinformatics can be fit to two groups of people: those that use command line tools and statistics properly to interpret data and those who build the command line tools and incorporate the appropriate algorithms so that others can analyze their data.
I can't be a 'Bioinformatician', I don't know how to program!
It is very likely that the first thing that you think about when think about the field is 'programming' and that knowing how to program is a requirement in order to enter the field. But this is NOT true! While many bioinformaticians and computational biologists know programming to a degree, it is not necessary to know how to program to succeed in Bioinformatics.
However I encourage everyone looking at getting into this field to come to terms that it is probably best if you learn some basic programming, there are still many analyses that do not have tools and knowing how to automate repetitive tasks is beyond helpful.
How do you know that 'Bioinformatics' is for you?
This isn't something I can answer for you. Often i'll see a question like this asked, the OP will give a brief history on their education and then a small section of what they would like to study, then hopes that someone will tell them 'yes you should study bioinformatics' or 'you would definitely enjoy bioinformatics'.
The truth is that Bioinformatics is constantly evolving and growing. If you enjoy solving problems, and you don't mind sitting in front of a computer 8 hours a day then you've come to the right place.
I am a wet-lab scientist / undergraduate student / Ph.D student ... What should I work on to get started in Bioinformatics?
Questions like this are easy to answer now. When I first started ... I was thrown into the deep end. I came into the lab on day one, and was tasked with re-creating the data from a paper. I had no idea what ChIP-seq, RNA-seq, or even R-Studio was. So the first skill you should work on before becoming a bioinformatician? patience. You will be learning a lot of stuff on your own ... 90% of the stuff you learn will be on your own if you aren't part of a lab that has an experienced computational biologist around to help.
You will be constantly be tasked with problems that you do not know how to handle, or how to solve. Everything that you do will require reading literature, detailed documentation, and constant attention. You must learn to be patient ... with yourself and with everyone else. It's very unlikely that you will learn everything you need to in a couple of months, much less a week. So you must learn to accept the failures and delays without becoming annoyed or anxious.
So what COMPUTER SKILLZ should you work on before starting to work in the field? Again the following is completely based on my current genomics research that heavily relies on NGS data.
Learn the Unix command line. You will be using basic commands such as
cd
,cut
,sort
ALL THE TIME so it's important to master the terminal. Take a look at some of these guides: Link, Link, Link, Link.Learn how to Bash script. When I first started analyzing datasets, I was fine with having to manually input all the file names and parameters into their respective programs until my data was processed. After a month or two of this, it became a chore and learning how to automate trivial things such as bedgraph generation using Bedtools became really important to my mental well-being. Remember: we want our work to be high quality, reproducible, robust, and as lazy as it can be. Here are some guides for learning basic Bash scripting: Link, Link, Link, Link.
Learn a bit of GNU Make. This is one of those things I wish I had learned for genomics data. I have hundreds of samples, and would run each of them by hand (not even in parallel) until I learned about pipeline creation. GNU Make isn't the best or easiest to use for this, but it is basically what all other pipeline software is in one form or another based on. It's easier to learn Latin if you know Spanish. Once you get a bit of GNU Make under your belt, look into more sophisticated pipeline software such as Snakemake, or Ruffus.
Learn how to use a terminal text editor. Vim / vi / nano / your preference. Gedit, and normal text editors like Atom are fine, but I find that I really dislike having to go back and forth and having to constantly open close editors. I use Vim. Most of these editors have little quirks to them so it's important to get used to them before you use them for work purposes.
Learn a programming language. The actual language you decide to learn is completely up to you. ANY language will do fine for Bioinformatics. The only difference between them is how common they are used for Bioinformatics, and ease of learning for newer guys. The most widely suggested language for a new Bioinformatician is Python, and for good reason. It is easy to learn, has numeral libraries, is elegant and powerful. I learned R as a first language through DataCamp, but I am currently learning Python through TeamTreehouse. These websites will not make you good programmers. They will teach you syntax and common functions, and very basic programming logic. Once you've learned the syntax of a language I recommend you go on GitHub and pull up the source code for a couple of really well reviewed software packages and read through it.
Learn R. Why R? R allows the creation of publish ready images via ggplot2, gplots, etc. And it has a HUGE collection of bioinformatics packages under Bioconductor for almost anything you may need. Need to create heatmaps? Cool we got a package. Need to annotate peaks? Cool we got a package. Need to analyze GRO seq data? Cool we got a package. Need to 'insert whatever'? Cool we probably got a package.
Learn to use GitHub. This allows you to keep version control of all your work, all the time. Just do it. Learn to create README files inside project folders that explain things you've done, and why. Also learn to comment all your code. It may seem silly while you're working on your project to spend a large amount of time writing down what you've done. But trust me ... in a few months when you need to remember what parameters you set for that one tool for a publication you will be thanking yourself.
Conclusion
Most importantly, have fun. Bioinformatics is a field where you never stop learning, so make sure you're constantly reading papers on new tools and methodology, check up on BioStars frequently and share your knowledge as you gain it.
Some of you may be wondering why the title says this is a guide for 'Computational Biology' but discusses 'Bioinformatics'. I am of the opinion that Computational Biology focuses more on the analysis of data (which is what I do) while Bioinformaticians are generally those that develop data. The two terms are probably interchange-able to a degree. As mentioned by someone in the comments:
There is an important distinction between bioinformatics and computational biology. "those who build the command line tools and incorporate the appropriate algorithms so that others can analyze their data." I'd call the latter computational biologists.
There is an important distinction between bioinformatics and computational biology. "those who build the command line tools and incorporate the appropriate algorithms so that others can analyze their data." I'd call the latter computational biologists. Nice post.
Agreed. I have quoted you in the original post to make the distinction more clear than what it was previously.
I don't think there are any strict definitions of what is bioinformatics and what is computaional biology, and if anything, (unless I have misunderstood your definition), I think those that write the algorithms are the computational biologists. As one example, Steven Salzberg's lab is called Computational Biology and Genomics,