I will be giving introductory courses for Linux, bash, Python, R, a bit of Perl, among others, in the coming semesters. The courses are aimed at graduate students doing genomics or population genetics research projects.
I will start the courses with a general overview of what Linux is about but then I need to be able to stress why it could be important to learn Linux and programing skills in order that the students gain more independence when doing their genomics analyses.
So, I would greatly appreciate your input on the following question:
> Why should someone doing a genomics project ever want to learn Linux?
I have quite a few philosophical, technical and practical justifications in mind, but I would like to know what your opinion is. You can also tell me why you think it is non-essential if this is your opinion.
Figure out for yourself whether you're teaching about bioinformatics tools or if you're doing Linux advocacy.
I'm a Unix user myself (OpenBSD in my case), but I would never put a Unix box in the hands of someone who is more proficient with Windows than with Unix, unless there was some other reason for them to be using Unix (tools, funding, ability to collaborate etc.).
Also, I wonder if there's anything called "Linux programming skills"? Perl, Python, C, C++ and most other languages can be programmed on most types of operating systems.
Do you have to justify the topic and methods you choose? I would try to avoid to start the prototypic flameware in the lecture. In comparison, in a lecture of protein structure you would not have to explain why you present the Photo System I but not a RNA-Polymerase...
I would say: you decide, you are the chef, and basta.
@cjt The courses are about bioinformatics, but I am going to spend a lot of time on teaching Linux and everything will be done in Linux afterward. I feel compelled to justify such a choice since it does require quite an effort on the part of the students. Cheers
@Andreas I'm not trying to do Linux advocacy, except that it seems the only powerful enough option to me... I also did not talk about 'Linux programming skill'. I mentioned 'Linux AND programming skill' Cheers
@Casey. Let's say that 'UNIX compatible systems' are great because the UNIX philosophy was great, but if I expect people to be able to use a UNIX compatible system quickly, I'm sure going to go for Linux using a Ubuntu distro. My objective is very practical. I'm not trying to turn anybody into a hardcore UNIX geek, only to give them powerful and flexible tools and teach then how to use them for advancing their projects.
@Andreas. Well, more or less. As I mentioned, I do see reasons, but when you fall in love with Linux, the big reason that matters in the end is loving to do your job every day just because I can use bash/Python/Perl etc. :) I can't expect them to understand that feeling from the start. I also don't want my argument to sound like that. That is the reason why I ask for the resourcefulness of this forum :)
unix texttools and vim, emacs etc... one often works with text files and always
has to peek a little bit (head, tail), mangle them (sort, cut, paste) etc...
simple to install and use software development tools (gcc, g++, python, perl)
On linux they are all installed and configured with one click.
multiple versions of a program can be installed by the user himself and switched on/off with sourcing some scripts without being administrator. On windows I always had to change the path in a very, very small textfield to which I had to click about 4 times.
a lot of good scientific software is written in a non portable way for linux/unix (almost all short read aligners, samtools). This makes it necessary to use Unix for genomics.
X windows: work on a powerful server and have the GUI on your thin client
+1 for installation/setup: package managers are maybe the reason why Linux is so much easier to use. I don't agree with the bug-ridden bloatware X windows though ;-)
X windows might be bloated, but I read that last bullet as more of a network model/multi-user capability. e.g. since UNIX machines are built with multi-user capability in mind, you can log in to large or small machines, one or many (assuming you have access to them), and accomplish things as needed. I do most of my work remotely through shell windows. I use giant computers even though they're not sitting on my desk. My jobs continue running after I disconnect.
Hi Brent. Toying with huge files is certainly one reason I think Linux is far superior (at least to W!nd0w5). I'll get a few examples of that type set up (data extraction, counting sequences...) to show them right away what POWER is about :P
Why should someone doing a genomics
project ever want to learn Linux?
Put simply, using anything else hinders your research and provides competitors using UNIX a distinct advantage. Without question, the best tools available in this field are open source tools that are largely written for POSIX systems. Yes, you can adapt these tools to Windows environments with Cygwin/VMWare, but part of being a scientist is knowing what the best equipment is for the experiment at hand.
because "put in a database the 10 first ordered sequences from the 100 last records about rotavirus at NCBI, but not containing the word VP7 " is as simple as:
The way I would formulate this is that unix like systems were designed to operate via action words that can be chained into 'sentences', whereas graphical operating systems like Windows present actions as fixed tasks that are easy to discover (right click shows them all) but cannot be easily shared, repeated, modified or chained into more complex tasks.
Data analysis in general (and bioinformatics in a particular) are domains where we need to express our goals in very detailed and nuanced ways and we need the type of functionality that a GUI based system lacks.
The statements above apply in general to other GUI vs command line discussions as well.
Thank you @Istvan. I like the 'words and sentences' analogy. I think I'll incorporate this to make the students understand part of the 'UNIX way'. Cheers
Following on from @Ido Tamir's list, knowing Linux allows a genomicist to:
develop a transferrable skill set that sets you apart from a wet-only biologist (having *NIX skills on your CV is an asset in the post-genomic research world)
better understand how computers and operating systems actually work
ability to run bioinformatics resources on your own machine (BLAST, GALAXY, etc)
ability to access ready-made bioinformatics computing environments (e.g. Bio-Linux)
ability to do reproducible research (BASH, R, TAVERNA, etc.)
ability to perform analyses on computer clusters (important for big/long computational jobs)
ability to access cloud computing resources (increasingly important for groups without access to HPC infrastructure)
Many thanks @Casey, I really like your take on this. I'll try to emphasis that there are skills that the courses will bring to them and that can be transfered/applied in their future research career. Cheers
One reason people use Linux is that there is an abundance of programs and libraries for bioinformatics written for Linux, like the EMBOSS suite and BLAST. Linux gives the user complete control over their system, and is thus easy to extend already existing software for new uses. Another benefit of using Linux is access to Bash, a very powerful command-line that can be used to create pipelines of multiple programs and their outputs. However, using Linux is not essential, as any Unix based system will operate in a similar way(OS X)
Thank you, these are indeed important reasons! For the OS X and others, I guess learning Linux is an advantage then. The material the students assimilate is going to be directly transferable to their MAC boxes and they will have learned about UNIX/Linux on the way!
Linux-based Systems are the operating systems of choice when it comes to remote computing. You can easily give commands via ssh. Remote file systems via sshfs/samba/ftp can be mounted into the system to occur as local drives. Forwarding the X-Server allows you to continue your work from any (Linux) computer.
These points are even more important for distributed computing. Most cluster software is Linux centred and I believe developing tools for MPI (for instance) is best done in a target environment - namely Linux.
Furthermore, the remote access also works in a offline way. For sure you can remember the last time you gave some advice to your Windows-using family member. The typical desperate telephone hotline: Click here, click there, click on Options - oh, there is no field names Options. What is the last entry? Quit? No the other one. Configure?... And so on, and so forth. In Linux you would just do a ./whatever -o thathelps
PS: I love all the small tools for Linux which make life so much easier (grep, cat, text processing, file conversions, batch jobs and piping,..). Starting in Linux at the beginning was quite hard, very soon my productivity started to be much higher than in Windows.
You may also mention how difficult it is to write high-performance programs for Windows. Surely it can be done, but sort of a nightmare, especially for C programmers (C++ is better supported in Windows). In addition, a few years ago, some core library routines, such as memory management, were substandard in comparison to the Linux equivalence. This is why most high-performance programs only work in Linux.
It is interesting how no one points out how much of an active choice it is to decide to learn using Linux compared to trying to stick to the old OS you were born with. There are of course some essential scientific reasons to be using it and these have been already exposed here. But discovering the open-source world, is for me, one of the most important rewards in learning Linux. Additionally, the open-source model clearly corresponds to the scientific approach: sharing results and methods for others to build on, in order for them to develop their own results and methods.
Discovering this might lead you to:
develop the reproducibility of your experiments (automatic pipelines with all your custom parameters)
contribute to and develop open-source tools for the community
contribute to the spreading of open-source tools in the community
learn how to take the best advantage of your hardware by controlling your OS
I will end by saying that taking the time to learn Linux has been by far one of my best 'career' choices.
Bioinformatics algorithms are often run on server farms ("in the cloud") for high MIPs processing. Writing applications to run on such servers is easier on a POSIX system.
I would recommend your students use OS/X because :
it is the world's most popular end-user Unix system and has the highest ease of use especially for use by non-computer scientists.
OS/X is written on BSD, the most recognized unix kernel in professional server environments.
Because OS/X is a commercial mass-market unix, the most frequent problem of open source Unix (Linux or BSD) is avoided: the system hardware is 100% supported by the software without any device driver problems.
more of their bio peers will be using OS/X (check the counts in the audience when at a conference).
there is no such thing as "running Linux", only "running a GNU/Linux Distribution" and the choice of distribution is a big decision itself with limitations and fragmented user bases of those environments. Arch vs Gentoo vs Ubuntu vs Redhat Enterprise vs Fedora vs SUSE vs just forget it.
Even after choosing the GNU/Linux distribution, there is a still the big decision of which desktop environment to choose and fragmented user bases of those environments. Gnome vs KDE vs XFCE vs etc etc etc just forget it.
I am for sure going to get negative votes for the above opinion (and likely some might rail -- incorrectly -- about the financial costs of "free" vs "paid" unix)
Even though I disagree that OSX is the best POSIX system for a bioinformatics working environment, I've upvoted this for your guts to push the merits of OSX. While I do think OSX is good laptop environment, I don't think it is the best bioinformatics environment for beginners or workstations/servers. This is because installation and use of many bioinformatics tools requires custom compilations or work-arounds that are ultimately just a waste of time. Take for example the need to install the developer tools just to get gcc working....
Hi @Jonathan. Thanks for your thoughts! I don't think I would personally recommend OS/X to anybody getting introduced to bioinformatics. A very practical reason for this is, I won't force them to buy a MAC for a few courses. I'll suggest they use Ubuntu. It's free, easy to try without installing, easy to install, supports an incredibly long list of hardware and has so many interesting packages ready for install. Anybody wishing to explore further the UNIX path, including BSD or OS/X can do that easily. I have seen too many mac users around me who fight with their macs to install software.
"Because that's the tool you're giving them."
Figure out for yourself whether you're teaching about bioinformatics tools or if you're doing Linux advocacy.
I'm a Unix user myself (OpenBSD in my case), but I would never put a Unix box in the hands of someone who is more proficient with Windows than with Unix, unless there was some other reason for them to be using Unix (tools, funding, ability to collaborate etc.).
Also, I wonder if there's anything called "Linux programming skills"? Perl, Python, C, C++ and most other languages can be programmed on most types of operating systems.
Do you have to justify the topic and methods you choose? I would try to avoid to start the prototypic flameware in the lecture. In comparison, in a lecture of protein structure you would not have to explain why you present the Photo System I but not a RNA-Polymerase... I would say: you decide, you are the chef, and basta.
@cjt The courses are about bioinformatics, but I am going to spend a lot of time on teaching Linux and everything will be done in Linux afterward. I feel compelled to justify such a choice since it does require quite an effort on the part of the students. Cheers
Linux specifically, or UNIX more broadly?
@Andreas I'm not trying to do Linux advocacy, except that it seems the only powerful enough option to me... I also did not talk about 'Linux programming skill'. I mentioned 'Linux AND programming skill' Cheers
@Casey. Let's say that 'UNIX compatible systems' are great because the UNIX philosophy was great, but if I expect people to be able to use a UNIX compatible system quickly, I'm sure going to go for Linux using a Ubuntu distro. My objective is very practical. I'm not trying to turn anybody into a hardcore UNIX geek, only to give them powerful and flexible tools and teach then how to use them for advancing their projects.
Ah, sorry for mis-reading.
If it is the only viable option that you can see, it implies that you already know how to justify it.
@Andreas. Well, more or less. As I mentioned, I do see reasons, but when you fall in love with Linux, the big reason that matters in the end is loving to do your job every day just because I can use bash/Python/Perl etc. :) I can't expect them to understand that feeling from the start. I also don't want my argument to sound like that. That is the reason why I ask for the resourcefulness of this forum :)
Sjeez and all these answers and comments are from the same people that complain that a basic R question is not bioinformatics (sorry couldn't resist).
Thanks a lot people! I have to give the right answer to somebody so I give it to the most popular answer. But keep the suggestions coming! Cheers