Forum:Bioinformatics software distribution
10
7
Entering edit mode
7.7 years ago

Hi fellow colleagues! Happy coming weekends =)

  1. What is the best way from your experience to distribute bioinformatics software and what is the best way for you to get software? Do you prefer to download compiled versions from https://sourceforge.net/, or seeing lively github repository and ability to compile yourself is crucial?
  2. When you distribute your tools what do you measure about its usage? How do you encourage people to really cite your tools as papers (not just web-links). Are there any tools to calculate how many times your software was cited as a web-link in papers? Do you try to get information about users of your software like email addresses, names, countries, phone numbers?
  3. How do you solve licensing and liability issue especially for something you had done on your own spare time as a side project. How do you protect yourself or maybe even find a way to make some money off your tools?

Maybe you know a good tutorial on this? If not, let's answer to this questions in the discussion below and make it as a tutorial for everybody interested.

UPDATE: two more questions (thanks to genomax2):

4. What do you think about having access for your tool as a library or a package for python, ruby, perl, R, other languages? Is this important and needed?

5. Are there any collaborations we can join to provide our tools to be part of bigger packages and still have the ability to publish about them, have control and support, maybe a way to sell it as well?

UPDATE: two more questions (thanks to Arnaud):

6. As a tool developer and as a user what do you think of software tools as plugins and how one can develop a plugin say for samtools?

7. Is Galaxy (or similar solutions) a good way to distribute software for tool developers and for users?

Thank you,
Petr

Thank you!

software • 5.0k views
ADD COMMENT
7
Entering edit mode
7.7 years ago

What is the best way from your experience to distribute bioinformatics software and what is the best way for you to get software? Do you prefer to download compiled versions from https://sourceforge.net/, or seeing lively github repository and ability to compile yourself is crucial?

Github is great, both source code and compiled binary files are available for most of bioinformatics tools, at least for mine (csvtk, seqkit, taxonkit, rush ...).

Github provides wiki and pages (What I choose) where you can host the documents of the software.

Read the Docs is also a good place for hosting documents, e.g, khmer.

When you distribute your tools what do you measure about its usage?

I distribute binary files via Github release, e.g., seqkit releases. The are tools (https://img.shields.io) that can count download count of every file or whole release, e.g, Github Releases (by Asset), more example on seqkit.

I also add tools to bioconda, so people can easily install and update via conda. Brew is also a option for OS X usres. Conda can also show the download counts, e.g., downloads, detail

How do you encourage people to really cite your tools as papers (not just weblinks).

Citation information are available on project page and command-line usage. What I can do is making my software better and hope people will cite them.

Are there any tools to calculate how many times your software was cited as a weblink in papers?

May using Google.

Do you try to get information about users of your software like email addresses, names, countries, phone numbers?

I did not and will not collect any private information of users.

How do you solve licensing and liability issue especially for something you had done on your own spare time as a side project.

It depend on the licenses of the dependencies. For me most of the dependencies are under MIT license and I also use it for my projects.

How do you protect yourself or maybe even find a way to make some money off your tools?

Never think about making money for my open-source projects on Github. Since they are open-source, it hard to protect the code.

Maybe you know a good tutorial on this? If not, let's answer to this questions in the discussion below and make it as a tutorial for everybody interested.

N/A

ADD COMMENT
0
Entering edit mode

Thank you shenwei356. Could you please say if you

you prefer to download compiled versions from https://sourceforge.net/, or seeing lively github repository and ability to compile yourself is crucial?

How in general you want to download and install software you use for science?

Also as a software developer and scientist what is more important for you (especial for your career, in writing grant proposals etc): how many times your paper was cited, how many times your software tools were downloaded or how many times your tools were used in a paper but sited as awebpage?

ADD REPLY
1
Entering edit mode

Source code and binary files are both OK for me, but the later is better. But sometimes it's hard to compile from source for some users, especially beginners. So I choose writing sicence softwares in Go, so I can cross-compile single binary files for Linux/Windows/OS X. Users really like this.

Is this a poll??

Lots of things are important for my career, citations should be the first one. You can see my paper citation on Google Scholar.

I didn't count the cases of being cited as a webpage. Although SeqKit was download 300+ times for v0.4.3 (Github Releases (by Release) ). But these's no citations or webpage links for now :(. They may just have not make the papers published yet :)

ADD REPLY
0
Entering edit mode

Kind of a poll, yes. I want to align my own preferences with our community preferences, standards, and needs, so I thought it is a great place to discuss how we distribute our software, how we analyse its distribution process and how we personally like to get software at the same time (and the way we cite it).

For example, I love tools that work out of the box on Linux and MacOS, no installation, no dependencies, just put it in the PATH or update the PATH and you are good to go (but I prefer to use direct links, so more control over the versions of each tool). I do not care about windows, in spite of the fact that one of the work laptops is Windows. I like to have access to all versions of the software tool just in case and I tend to keep all versions I used in the archive. I hate to provide any information about myself to download and try a tool. Moreover, I do not mind to provide some help on making it better if I like it, but I have not seen a tool, that encourages this. Most encourage to cite the paper, but to be honest, I sometimes cite with a web link instead of a proper reference to a paper, especially if I was not able to find a proper paper to cite in a few minutes (maybe even a few seconds). Also, I am ok with software that checks with its web server to tell me about available updates.

To conclude I think there is a gap between what user want and do and what software tools developer want and do in terms of software distribution. Maybe I am wrong and there is no gap. But if there is one, we can address it from both sides.

ADD REPLY
1
Entering edit mode

I do care about Windows users, who are about 1/3 of whole users of seqkit and csvtk according to the download history.

These's no need to hate anything that needs filling information to download, they may just want to track users and take a poll, e.g., SPAdes download page encourages users to fill user information but also explicitly provides direct download links

Github is a good place to communicate with the developers and contribute or help to improve the projects.

Published softwares that encourage citation usually provide paper links.

Softwares that check updates are common, I did this too. Using packages management tools like conda and brew is also a good way to keep them updated.

ADD REPLY
5
Entering edit mode
7.7 years ago
Charles Plessy ★ 2.9k

Here is my very biased point of view of member of the Debian Med project.

What is the best way from your experience to distribute bioinformatics software and what is the best way for you to get software?

We distribute binary package, with names as close as possible to the original, so that our users can install a tool with a command such as apt install name-of-the-tool.

Do you prefer to download compiled versions from https://sourceforge.net/, or seeing lively github repository and ability to compile yourself is crucial?

We download the source from the upstream developers, that we use for building our binary package. Regarding how the source is made available upstream, are not picky but our task is much easier when it is available from a Git repository, and releases available both as a file archive and a Git tag with machine-predictable URLs so that we can detect new releases automagically.

When you distribute your tools what do you measure about its usage?

http://popcon.debian.org

How do you encourage people to really cite your tools as papers (not just weblinks).

We provide citation information in YAML format; see the one of the seaview package for example. Patches are welcome when the information is not up to date.

Do you try to get information about users of your software like email addresses, names, countries, phone numbers?

Apart from the popcon system, which is opt-in and uses encrypted communications, we do not track anything from our users, and we make our best efforts to make sure that the software that we redistribute does not track them either.

How do you solve licensing and liability issue especially for something you had done on your own spare time as a side project.

In brief, we strongly recommend to use an existing, well established, license and never attempt to write one from scratch. More detailed guidelines are available in our upstream guide, that covers many more topics and that I encourage everybody to read !

How do you protect yourself or maybe even find a way to make some money off your tools?

In the Debian distribution, there is only Free software, and this freedom gives the authors and everybody else the right to use the software commercially and earn money from using or selling it. (Be careful that besides Debian proper, there is also the non-free section, that contains open-source software that are not Free, like cufflinks for instance, but we strongly discourage the creation of non-Free software.)

UPDATE: two more questions:

update: two more answers

What do you think about having access for your tool as a library or a package for python, ruby, perl, R, other languages? Is this important and needed?

Tools wrapped up as regular modules/libraries for the languages listed above are easier to package for us, since for each of them we have dedicated teams with considerable experience. This said for C and C++ I would recommend to refrain from distributing a library unless it is really intended to be linked from by third parties, and there is a commitment to properly handle the breakages of backwards compatibility.

Are there any collaborations we can join to provide our tools to be part of bigger packages and still have the ability to publish about them, have control and support, maybe a way to sell it as well?

Publishing is great, and I feel very well the deep pressure for doing so, but sometime I dream of a world where contributing a new function to an existing toolkit is regarded as positively as creating a new stand-alone tool... and not have time to maintain it later.

ADD COMMENT
0
Entering edit mode

Thank you Charles. This is very helpful. Especially YAML format for citations idea is super handy. I will read your licencing guide. Thank you

ADD REPLY
4
Entering edit mode
7.7 years ago
igor 13k

Do you prefer to download compiled versions from https://sourceforge.net/, or seeing lively github repository and ability to compile yourself is crucial?

You can have compiled binaries on GitHub (and track versions, too): https://help.github.com/articles/creating-releases/

For example, samtools: https://github.com/samtools/samtools/releases

Not sure how trustable SourceForge is now: https://www.howtogeek.com/218764/warning-don%E2%80%99t-download-software-from-sourceforge-if-you-can-help-it/

How do you solve licensing and liability issue especially for something you had done on your own spare time as a side project.

GitHub has a nice tutorial for that: https://choosealicense.com/

ADD COMMENT
1
Entering edit mode

One annoyance with distributing releases via GitHub is that many people don't know that they have to click on the rather hidden Releases tab - I've had several interns and students get confused when I sent them a link to the project on GitHub

ADD REPLY
1
Entering edit mode

So providing download or installation link in project home page is necessary. e.g.,

Table of Contents

ADD REPLY
1
Entering edit mode

Yes, the "releases" is not obvious at all. You can have a link in your README to the releases page, though. I think that's how I originally found out about it.

ADD REPLY
0
Entering edit mode

Thank you Igor,

What are your thoughts on:

When you distribute your tools what do you measure about its usage? How do you encourage people to really cite your tools as papers (not just weblinks). Are there any tools to calculate how many times your software was cited as a weblink in papers? Do you try to get information about users of your software like email addresses, names, countries, phone numbers?

ADD REPLY
1
Entering edit mode

GitHub provides some statistics, but those are only for visitors.

You can have your software tool ping a specific URL every time it is run. That's a little bit questionable from a privacy perspective, but is technically possible.

For example, GATK used to have a "phone home" feature until recently: http://gatkforums.broadinstitute.org/gatk/discussion/1250/what-is-phone-home-and-how-does-it-affect-me

ADD REPLY
4
Entering edit mode
7.7 years ago
GenoMax 147k

Since this thread as already grown large I am not going to try to multi-quote but let me see if I can remember things I wanted to respond to.

  • Even before I choose to download something you need to convince me to try your software in words. That can be done by a post on Biostars/SeqAnswers and/or a great ReadMe on your SF/GitHub repo. There are many packages out there that can do "X" and I am not generally looking to get a package that will do a one-off thing, unless it is something I can't find in another package.

  • If you don't support a specific OS make that clear upfront. Post a list of OS's you (or your users) have successfully tried your software on. If you are providing your software for free you are not obligated to support all OS's.

  • Make it easy for users to download/use your software. If you must use a dependency then preferably make it so it would be a compiled entity. Having to go down a rabbit-hole of x needs y, y needs z totally confuses consumers. If you are able to make binaries available for common OS's (along with source for those who work on clusters etc) that is the best option.

  • If you can't make a self-contained download entity then consider making your software available via conda/brew/apt-get etc.

  • If a grant supports your software development and the agency requires you to collect usage/stats then make it simple. I am happy to provide an email address as long as I am assured that it will be used for just that. Make it a one time thing (for updates I should only need to provide the email address (like IGV)).

  • It may be feasible to make money from your software by licensing it for commercial usage/inclusion in other commercial software packages.

  • Chances of you having a publication about your software before its widespread usage are small unless it is part of a larger study (this has been a topic of discussion in past). So you are going to have to contend with citations coming for software repo for a while.

  • You are going to walking a thin line if you are developing the software on your own time but are still employed. You would want to keep your ducks in order not to get in trouble with your employer down the road.

  • Ultimately the quality/utility of your software will decide how far/where it will go.

ADD COMMENT
1
Entering edit mode

Very useful comments. I'll give you 5 upvotes if I can!

ADD REPLY
0
Entering edit mode

Thank you genomax2 . Very useful indeed. How do you collect info on your software mentioned in paper as a web link (not as a citation to the paper)?

I would love to accept your answer, but I want to continue this discussion and hear as many ideas as possible. I think we all benefit from this conversation.

I like the idea of having all tested OSs listed in the documentation upfront.

As you say packages are less useful in general than tools covering broad spectrum of tasks. At the same time you mention that software can be included as part of other commercial software. This is interesting idea. From this I have two more questions that I will add to the main first question as well:

  1. What do you think about having access for your tool as a library or a package for python, ruby, perl, R, other languages? Is this important and needed?

  2. Are there any collaborations we can join to provide our tools to be part of bigger packages and still have the ability to publish about them, have control and support, maybe a way to sell it as well?

ADD REPLY
1
Entering edit mode

You should be able to get papers that list the link for your software from Google scholar. Here are examples for FastQC and BBMap.

Some people prefer to stay in one environment (e.g. R) so for them having your software available as a package may be useful (e.g. featureCounts as part of package Rsubread). I am not sure how much additional effort is needed to do that and it becomes something you would need to keep up with on an ongoing basis.

It is tough to publish tools as standalone papers but at the same time it has been done before (TopHat, HISAT, STAR etc).

BTW: You can accept more than one answer (when you finally feel like you are satisfied).

ADD REPLY
0
Entering edit mode

I am already very satisfied and conversation and ideas shared here way surpassed my expectations

ADD REPLY
4
Entering edit mode
7.7 years ago

What is the best way from your experience to distribute bioinformatics software and what is the best way for you to get software?

I use Github for distributing source code and pre-compiled binaries and OS X installers for our BEDOPS toolkit. I use readthedocs to distribute documentation. We transitioned from Google Code. Our documentation efforts were so successful that other toolkits have copied the style and approach of our work, which is praise in its own way!

I also make use of package installers like Bioconda and Homebrew to make our software available (e.g., conda install bedops and brew install bedops). Some people are working on a Debian recipe, so maybe it will one day be available via apt-get or the like.

Do you prefer to download compiled versions from https://sourceforge.net/, or seeing lively github repository and ability to compile yourself is crucial?

Github allows the power user to clone a repository and compile for him or herself, or download precompiled binaries for the less savvy user.

Making source available is key, I think, whichever way you choose to do it, whether Sourceforge, Github, Bitbucket, etc. Closed source tools are black boxes and can't be tested and evaluated properly.

When you distribute your tools what do you measure about its usage?

Github gives limited information about downloads and makes tracking difficult, if not impossible. This is definitely a weakness, in terms of gauging demand for source code, but it's a positive for end user privacy, I suppose.

We can track access to documentation via Google Analytics, which gives a very rough idea of where repeat users are coming from. People can install tracker blockers, as is their right, so the conclusions we can take from this level of tracking are limited, at best.

How do you encourage people to really cite your tools as papers (not just weblinks).

We can't obviously enforce citations, but we post citation information everywhere, including online in the Github front page, the front page of the documentation, and in the --help option of all command-line tools!

Are there any tools to calculate how many times your software was cited as a weblink in papers?

Google Scholar offers some limited options.

Do you try to get information about users of your software like email addresses, names, countries, phone numbers?

No, not really. Keeping track of where people come from is not very easy from Github. We do add a Google Analytics tracker for the documentation site, which offers us a very very coarse snapshot of where interest in BEDOPS is on a country-by-country level.

How do you solve licensing and liability issue especially for something you had done on your own spare time as a side project. How do you protect yourself or maybe even find a way to make some money off your tools?

We have always offered our software under open-source license terms. We haven't obviously been able to stop others from copying the algorithms and approaches we documented in our Bioinformatics paper, but that's just life in academia, I guess.

Some people offer their software with two licenses, one for academia and another for commercial users, which is for-fee. Or they offer support or service contracts. Those may be ways to monetize and support ongoing development efforts.

ADD COMMENT
0
Entering edit mode

Your idea on using google analytics for documentation website for your tool is awesome. Very very useful. Thank you Alex!

ADD REPLY
3
Entering edit mode
7.7 years ago

On the matter of software distribution: As a user/programmer, I find programming languages central repositories (e.g. CPAN (perl), CRAN (R), PyPi (Python)...) incredibly useful and convenient. As a user, I'd rather get precompiled binaries for my system if the software has too many dependencies, in particular on libraries not available from my OS's package manager or for situations where I don't have admin rights.

As for licencing, this is a can of worms. Just as an example, there's some debate about whether the new terms of services at GitHub are compatible with some open source licences (see here).

ADD COMMENT
0
Entering edit mode

Thank you Jean-Karim

ADD REPLY
2
Entering edit mode
7.7 years ago
Arnaud Ceol ▴ 860

UPDATE: two more questions:

1) What do you think about having access for your tool as a library or a package for python, ruby, perl, R, other languages? Is this important and needed?

This is often at least really useful. It means that you can integrate it in your own software/library, without having to reimplement it. About other languages, don't forget about Java (and BioJava).

I believe that the best practice is publishing the library + the desktop/web/whatever application to use it straight forward.

2) Are there any collaborations we can join to provide our tools to be part of bigger packages and still have the ability to publish about them, have control and support, maybe a way to sell it as well?

There are several communities like this. I'm thinking for instance at the BioJs community https://biojs.net/ : each component is developed by different group, and published independently (there is even a channel for it on F1000: enter link description here). It makes it easy for programers to search, use and cite the right components.

Another great way (to my opinion) to release software is to create plugins for already widely accepted software rather than creating a new stand alone program. It makes it more visible and easy to integrate with other functions. Cytoscape (enter link description here) is a really good example, and each plugin can be published independently (there are many examples). The Integrated Genome Browser (IGB) is following a similar strategy.

ADD COMMENT
1
Entering edit mode

same to bioconductor

ADD REPLY
0
Entering edit mode

Thank you, Arnaud. Very interesting. What do you mean by "web/whatever application to use straight forward"? You say there is a way to have web tools, so we can imagine, say samtools that are web application? If so, it is an interesting idea. For applications with a small amount of data to input and output, I can imagine it as a web service with RESTful API and json output? Do you know of examples of such tools?

Also, this reminded me of Galaxy. I am curious why nobody mentioned it for a way to distribute software tools?

Plugins are also very interesting idea. Is there a way to develop a plugin for samtool or FastQC for example. How do you make sure it is compatible with updates? I had never done anything like this. Could you please provide some guidance and explain pros and cons? Thank you.

ADD REPLY
1
Entering edit mode

"web/whatever application to use straight forward": I'm justing meaning any application (either desktop or web based), that can be used out of the box (as opposite to a library that should be integrated into other software).

As far as I know, it is not possible to create plugins for Samtool and FastQC.

ADD REPLY
0
Entering edit mode
7.7 years ago

This is very interesting discussion. Thank you. I find great ideas in each answer and learn a lot. As discussion evolves, I get new questions about bioinformatics software distribution. I am adding them to the main first question as well. Here are the new questions I have:

4) What do you think about having access for your tool as a library or a package for python, ruby, perl, R, other languages? Is this important and needed?

5) Are there any collaborations we can join to provide our tools to be part of bigger packages and still have the ability to publish about them, have control and support, maybe a way to sell them as well?

ADD COMMENT
1
Entering edit mode

1) It's not important and needed, but it would be nice.

2) There are pipelines like QIIME(http://qiime.org/) that comprises of academic version of commercial software like USEARCH. Some researchers published research paper and then publish the involved workflow as a separated software paper.

ADD REPLY
0
Entering edit mode
7.7 years ago

Thank you, all of you, for great commends. I got two more questions (thanks to Arnaud):

6) as a tool developer and as a user what do you think of software tools as plugins and how one can develop a plugin say for samtools?

7) is Galaxy (or similar solutions) a good way to distribute software for tool developers and for users?

Thank you

ADD COMMENT
0
Entering edit mode
7.7 years ago

Hey, wonderful biostars community! Thank you all for helping me in figuring out the best way to distribute bioinformatics software. Special thanks to shenwei356, Charles Plessy, igor, Philipp Bayer, genomax2, Alex Reynolds, Jean-Karim Heriche, Arnaud Ceol, who helped me (and hopefully you too) in understanding of bioinformatics software distribution. Thank you! Thank you! Thank you!

I will make it into the tutorial as I promised in the next day or two. The reason for this delay is that we at ALAPY were working nights on a fastq compressor to test our algorithms and ideas. You can check it out at <link>. Thank you for being around. I will appreciate every comment and idea about it. For such discussion of NGS data compression and ALAPY Compressor, in particular, I created a new topic in tools sections about it here. Hope that was the right place. I will continue working on answering question and writing tutorial from now on because I really missed this for a few weeks.

ADD COMMENT

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6