Question

Organizing Scripts for Bioinformatics Analysis

1

Entering edit mode

4.3 years ago

Pappu ★ 2.1k

I am wondering if there is any good way to organize hundreds of scripts in python or R written over the years so one does not have to write the similar functions again and again after a while. Is there any good way to keep track of them? Oftentimes it is easier to rewrite them in 10-15 min rather than searching for similar code written a while back.

RNA-Seq ChIP-Seq R python • 1.1k views

ADD COMMENT • link updated 4.3 years ago by ATpoint 87k • written 4.3 years ago by Pappu ★ 2.1k

1

Entering edit mode

Guess you have not found a solution for this yet :-) (Ref: Organizing bioinformatics oriented python scripts , found that in the "Similar posts' in right column --> )

ADD REPLY • link 4.3 years ago by GenoMax 149k

0

Entering edit mode

There might be a better solution after 5 years!

ADD REPLY • link 4.3 years ago by Pappu ★ 2.1k

1

Entering edit mode

Not the simplest, but probably the best answer is:

Turn them into an installable, runnable/usable library with documentation.

You'll no doubt deduplicate a lot of the code, and find more efficient ways to achieve the same thing.

ADD REPLY • link 4.3 years ago by Joe 22k

score 2 · Answer 1 · 2020-11-02

For R I started putting code that I use repetitively into R packages on GitHub that I can then install via remotes::install_github("ATpoint/...."). For example this one here which collects my code related to differential analysis of NGS data. The README contains a short overview on what each function of the package does and it contains a simple guide on what I have to install at minimum to get everything running. In this case the README tells which Bioconductor pacakges I have to install manually, and the CRAN dependencies are then automatically being taken care of, based on the DESCRIPTION file which lists the dependencies. Since R packages allow to define help pages for every function you can simply use ?myFunction() and a help will pop up (that you of course have to write first when developing the package) explaining the function and the arguments. That way you remember even after months of writing the function how to actually use it.

I mean sure, any serious R developer would for sure facepalm given the amateurism of this little package there, but at least for me I have things summarized in a more or less organized fashion, version-controlled and can easily install it on any machine without copy/pasting any scripts that are scattered on years-old folders from previous analysis. If you then also save the scripts to actually run the analysis with code from the package in a version-controlled environment such as Git then things should be fairly organized and reproducible. I now started to have the actual analysis code organized via Rmarkdown documents, quite handy actually. There are for sure more efficient ways of doing this, but this currently works ok for me, especially given the tradeoff I have to make since I am part wetlab, part computational and therefore cannot spend all day sitting in front of the computer and optimizing things to the maximum.

For python you can also put things on GitHub, and then build a package (or whatever the correct term in the python world is) from this which you can then pull via something like pip. The key is in any case to have it version-controlled, centralized and easy to transfer to any machine via a single command (like pip or install_github(...)).