Forum:Books and Blogs guiding the programming practice in bioinformatic research?
3
3
Entering edit mode
4 months ago
JustinZhang ▴ 120

Besides Bioinformatics Data Skills (2015) by Vince Buffalo, is there any books and blogs teaching or sharing the experience of programming for bioinformatic practice?

Background

I majored in clinical medicine and studied Python, R, Unix and machine learning independently. When it comes to writing pipelines, complex data analysis scripts, I'm easy to get confused about the coding style, unit test, trade-off between coupling and decoupling. Sometimes I found myself struggling in managing functions, scripts and files, and ended up spending way more time and energy on coding than I expected.

There is literally nobody around me can help me. About 99% of my colleagues are at wet bench. The book mentioned above helped me a lot, but it's somehow outdated and it does not covered programming (I mean, write python/R/Julia scripts, snakemake/nextflow files) enough. I have tried to read the popular bioinformatic repos on Github, but the lack of comments make it hard to understand certain parts. LLMs can help, but their answers are instable and inconsistent.

Question

To improve this situation, is there any public resource I should know well? And is there any people have similar experience can share your approach? Thank all of you in advance.

R Python Programming • 768 views
ADD COMMENT
4
Entering edit mode
4 months ago
DGTool ▴ 290

I guess personally, if it's regarding writing software, I don't think it needs to be bioinformatics specific resources. In general its just like writing regular software but just in the context of biology, or answering biological questions. Even outside bioinformatics and just general software engineering, the quality of people's code can vary /a lot/, as well as the coding styles (these can also depend on the place where you work). One good advice I always hear is that when adding comments to certain parts, the comments should be /why/ the code was added, not what the code does (i.e. not "This function reverses the sequence"). General programming resources (e.g. blogs, books, youtube videos) which go into detail about the various topics like unit testing (i.e. Test-Driven Development) or any other topic would be good enough, and then you can just apply these when writing your own programs or scripts. (Also I would say in bioinformatics/academia there is a tendency to just get code to work and not really make it look too good, and in general there does seem to be much of an incentive to maintain code once published in academia. This can make it very difficult to sometimes understand code other people have written.)

ADD COMMENT
0
Entering edit mode

"the comments should be /why/ the code was added, not what the code does (i.e. not "This function reverses the sequence")." Top tip. Thank you!

ADD REPLY
0
Entering edit mode

the comments should be /why/ the code was added, not what the code does

I think comments should serve both purposes. Sometimes the code is very clear, so the comment should be about why it is there. Other times the code function is not obvious at all, and describing its functionality is essential.

ADD REPLY
4
Entering edit mode
4 months ago
BioinfGuru ★ 2.1k

This is a great question.... and describes the reality of learning bioinformatics (even with a MSc bioinformatics) in my experience. My answer is probably more directed at those viewing this question who are generally less experienced than you.

To answer your main question: Is there any books and blogs? There are countless. My favourites are: Biostars Handbook, Modern Statistics for Modern Biology, Computational Genomics with R, and Bioinformatics Data Skills.

Here's my main advice:

  1. What I used to do: Save many (many!) browser bookmarks (well organised) for different languages, tools, datasets that help when I get stuck. I still have them and they have been very helpful.
  2. What I do now: Where ever possible, I stick to 1 language (for me it is R). It is more useful to be advanced in 1 than beginner/intermediate in multiple. When I find something that helps, I study it, summarise it, I write the code with comments and notes, and if possible, I write a function for it. Then save it for later, keeping related code in the same place. I can use the code snippet later when I hit a similar problem. This all takes a lot of time.... the first time. For example, I have 1 large function for pca analysis of deseqdatasets that I built over time that takes a dds object and spits out screeplot/biplot/triplot/heatmap/correlation plot etc. Each one took time to learn, write, and test. Now it just works, I trust it, I understand the code and comments (because I wrote it), no more time spent. The only bookmarks I tend to save now, are for text books or a really helpful blog post with so much in it that I haven't got round to studying fully yet e.g. A guide to designs and contrasts in DESeq2

Just to highlight a few of your points:

  1. I'm easy to get confused: Been there, still there, will always be there. We are learning. Where ever possible, I avoid the use of complex layers (galaxy, snakemake, nextflow). Galaxy: IMO will keep anyone at beginner level bioinformatics forever. Snakemake/nextflow have there uses (i.e. devops/opsec/collaborations) but when I am the only user, it just adds unnecessary and confusing complexity.
  2. Spending way more time and energy on coding than I expected: Me too, that is the nature of learning how to do anything right, that's why I make the notes I do: next time, I'll have the code snippet already available.
  3. There is literally nobody around me can help me: That is why we are here, and also Bioconductor Forum, and Posit Community
  4. LLMs can help, but their answers are instable and inconsistent: Completely agree. I use ChatGPT from time to time but (and it's a big BUT), I have noticed it only tells me the general knowledge that would be expected to be known by a degree graduate. It does NOT tell me things that someone with experience would know. For example: It will tell me how to combine multiple tissues in differential expression analysis. but it will not warn me what that may do to my normalisation or why doing so is not appropriate for my data... unless I prompt it. If I'm not aware of the pitfalls... it won't help.

Hope that helps, Kenneth

ADD COMMENT
1
Entering edit mode
4 months ago
filip.buric ▴ 10

One book I adapted for my courses (which are designed for the kind of level/position you seem to be at) is The Pragmatic Programmer by Andrew Hunt and David Thomas. It addresses a lot of what you mention. While it was written for software engineers, I think most of it is perfectly adequate for bioinformatics development, and I would guess you know enough now to just start reading it. Maybe borrow it from the library first, however, to see if it's good for you, before buying. It's a classic in the field so very likely they have copies. There's an updated edition that came out a few years ago:

In the software industry, a lot of "wisdom" and good practices are absorbed from senior colleagues, as well as domain-specific workshops, besides books. There are some workshops for researchers as well, though unfortunately not as many that are about software development practices. Notable:

  • The Software Carpentry has a number of free workshops on various topics, hosted by volunteers at universities. They also have quite good free online material, but maybe these are too basic for you now?
  • Greg Wilson has written a lot about good software development practices for science. Two of his papers that I have bookmarked are 1 and 2.

(Advertisement:)

I have started selling courses that are meant for this cross-disciplinary niche, especially researchers without formal computational education. Since half of my background is in computing, I want to adapt / translate industry practices for research. I don't have much in video form right now, however. Perhaps the most useful would be my software testing course. But The Pragmatic Programmer covers a good deal of what I talk about, so maybe check that out for free first :) The extra part I add is about testing stochastic code and I expand on Property-Based Testing a bit.

ADD COMMENT

Login before adding your answer.

Traffic: 1566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6