I have a new PhD student just starting a project on evolutionary comparative genomics. This will involve interaction with Ensembl, analysis of introns, exons, gene orthology, rate and pattern of substitution, that sort of thing. I have always thought highly of Bioperl (and much less highly of Biopython) mostly because of the enormous quantity of code available at Bioperl and the larger user base.
The student (who can write in either language) has suggested that the way to go is with python, partly because of the capabilities of PyCogent. I don't want the PhD to be spent reinventing the wheel, but after initial reluctance I'm starting to be convinced by python!
Specifically:
(1) Which would you choose if it was your PhD? (2) Any issues (other than code availability) I'm overlooking regarding a comparative genomics PhD using Perl vs Python? (3) Does anyone want to bravely predict the future of perl/python bioinformatics and comparative genomics, say 3 years from now- status quo or outright victor?
There can be be no victors in this - the mere fact of one approach potentially ending more popular than the other would not imply that it had won.
The matter of choice is a matter of matching your own way of thinking. Programming is pure thought translated into code - if you master one programming approach you'll be able to easily switch to different one.
I agree too. Apologies, my language was sloppy. Victory was a very poor word to describe "much larger and more active user community". I want my student to deal with questions not just re-writing existing modules. A large and active user community helps enormously- as Biostar shows regularly.
If your PhD student is fluent in Python, then my personal experience is that all of what you describe can be done using Python. This is what I have done, while admittedly re-inventing the wheel sometimes. However, Python coding is so natural and fast that I have not spent more than the equivalent than a few days coding.
It is undeniable that Python is catching up, although some areas seem to be still much more better served by Perl. As a Python fan now learning Perl, what I see may decide what language will be more popular in the future is going to be played between Perl's broader base of code versus Python's more readable and 'logical' (to my mind at least) code. I'll bet on Python becoming more popular just because, having learned it, you can really express yourself naturally without straining your mind and I believe that this is going to be more important for that class of people described as 'non-typical programmers', among which science students and researchers belong most of the time, having had no formal and solid programming teaching.
I would also consider synteny - conserved gene order - when embarking on a comparative genomics project. Thus, one may need to look at the languages in which tools that examine syntenic relationships are written. Comparative genomics is not simply looking at base pair by base pair comparisons, but may/can involve gene by gene or genomic length by genomic length comparisons.
Since the student can code in both languages, they should be able to use either, depending on the task to be undertaken. For example, if a useful method exists only in the (more mature, more extensive) Bioperl library, they should use Bioperl for that task. If they require, say, a speed boost via Python, then they should choose Python. I think an important aspect to acquiring bioinformatics skills is knowing how to choose the correct tool for the job at hand, as opposed to imposing a pre-conceived idealistic view on the problem. More tools in the toolkit = a better bioinformatician.
I agree with you that re-writing tools simply because they don't exist in the "language of choice" is a complete waste of time. Such students are better suited to computer science than bioinformatics.
Good answer Neil. I had assumed that mixing languages would produce a disjointed, fragmentary approach, but maybe it could be a strength it it promotes critical assessment.
I can see advantages in using one language, e.g. consistency, maintaining the code base. Then again, I would never try to code something in Perl if it were easier to achieve using R.
Python, Perl and other languages [e.g. Ruby, R] (along with their bio* packages) have different strengths. If you can learn one, learning the others should not be too difficult. Coding in the language that you know is human nature, but can lead to overcomplicated code when a library from another language would have been easier for a specific task.
Given that the Ensembl API is in Perl, knowing some perl would be a good start. In my experience it s great for writing a short program < 10 lines. However if you only learn perl you will miss out on great features from other languages such as Ruby+Rails (or Python/Django) or the packages you mentioned.
In general always check existing code for what you need. Whether that be libraries or full programs. This may seem like common sense but I see many new programmers too focused on their own code at the expense of checking other peoples.
Well, if you don't learn Perl you will miss out on great features like Moose (a more advanced object system build on the MetaObject Protocol), Catalyst (like Rails/Django but more flexible and modular) and CPAN (a huge code library which no other language can compare with).
Perl has a long history in bioscience (withness the great intro book Beginning Perl for Bioinformatics from O'Reilly) so it already has a head start on Python or Ruby. There are currently 2451 Perl modules in the Bio:: namespace on CPAN, so chances are you'll find most of what you need for your research.
As for what language feels easiest to work with, that is a matter of personal taste. Perl can be idiosyncratic, especially without a C or UNIX background. I find core Ruby very elegant and natural, but the ecosystem isn't as mature as it should be. I haven't touched Python since I learned it in Uni. I feel the "one way to do it" philosophy is a straightjacket, and having a syntax depending on invisible control characters is a design failure.
I had the same questions early on during my PhD. Went towards perl/bioperl. But then got stuck with obnoxious bugs, unreadable code & the difficulty of programming the way I wanted to (eg: to easily pass hashes of arrays of hashes back and forth between functions). So I tried ruby, and haven't been back. I feel that coding in ruby simply requires much less thinking. While there are less existing tools, I think my productivity is higher.
(for certain things piping shell commands is sufficient, or R/bioconductor is the best solution)
My vote: let him play around a bit & then choose what makes sense to him. That's likely where he'll be the most productive.
I used it a while ago for some comparative stuff I did...and I think its becoming more mature and much more powerful. It supports quite quick interval based queries over large sequence/interval databases. It is of course written in Python - and if I believe certain people I know - is the future.
Of course I agree with all of the "programming languages are tools - not life choices" arguments. I personally believe that people should learn Python first and then Perl (they will understand all the annoying "features" of Perl and appreciate some of its good points compared to Python).
Thanks, I'd heard of pygr, but I've only started reading the documentation after you mentioned it. It looks really interesting, I shall definitely have to investigate this more.
Thanks for pointing pygr link. It is very tempting for people like my new to python with a perl background. Interesting though, the 'Documentaiton', 'Tutorials' and othter links in "Links" subsection of that web is not acitve!
I have written something about almost the same topic few months back. I have worked with both and hence I am more inclined to Python. By no means I want to discourage people who can code well in PERL.
I agree too. Apologies, my language was sloppy. Victory was a very poor word to describe "much larger and more active user community". I want my student to deal with questions not just re-writing existing modules. A large and active user community helps enormously- as Biostar shows regularly.
I agree! Don't force the programmer to adopt or adhere to a given language when the goal really is to achieve a working tool or algorithm.