Entering edit mode
5.9 years ago
field.cady
▴
50
I work on the Semantic Scholar academic search engine, and wanted to let people know about a blog post we just put up where we combed through 7 million biomed papers and found the most-mentioned GitHub repos. The top one was Sickle, and 14 out of the top 15 were bioinformatics (the 15th was Keras, a general deep learning library). Enjoy!
This work could definitely be extended to other hosting services like sourceforge.net - all we need is to know what the URLs look like so we can find them in the papers' full text.
Also it is DEFINITELY true that there is a problem of people referring to software by name versus by publication. Unfortunately that's really hard to solve, because we would need to have some prior knowledge of what software packages correspond to what github repos. There are a couple repos on this list, for example, that have an associated paper that has more citations than any of these repos has mentions. Another way to look at this though is that this work is biased in the direction of small-ish software packages that are unlikely to have a full-fledged paper associated with them.
Would be pretty interesting if somehow this addressed the software-by-name or software-by-publication too, as github URL is not necessarily as common if you are citing a tool (if you are presenting a new software it is quite common)