Which Bioinformatics Tools Are Written In Python
8
16
Entering edit mode
12.4 years ago
Chen Sun ★ 1.1k

Which bioinformatics tools are written in python?

I ask this question because new bioinformatic programmers or new pythoners like me can read the source code to find out how python can be used to deal with complex bioinformatics problems besides the problems solved in related books such as "Beginning Python for Bioinformatics"

Thank you

python • 27k views
ADD COMMENT
25
Entering edit mode
12.4 years ago

There are so many! To get you started:

  • Biopython: set of freely available tools for biological computation
  • PyMOL: molecular visualization system
  • PyCogent is a software library for genomic biology
  • Galaxy: an open, web-based platform for data intensive biomedical research
  • pygr: sequence and comparative genomics analyses, even with extremely large multi-genome data sets
  • Biskit: facilitates the manipulation and analysis of macromolecular structures, protein complexes, and molecular dynamics trajectories
  • Ruffus: a lightweight python module for running computational pipelines
  • Pysam: for reading and manipulating Samfiles
  • msatcommander: locates microsatellite (SSR, VNTR, &c) repeats within fasta-formatted sequence or consensus files
  • glu-genetics: tools to store, clean, and analyze data generated by whole-genome or candidate gene association scans
  • PySCeS provides a variety of tools for the analysis of cellular systems
  • OpenAlea: odules to analyse, visualize and model the functioning and growth of plant architecture
  • ETE assists in the automated manipulation, analysis and visualization of phylogenetic and other type of trees
  • bx-python: allows for rapid implementation of genome scale analyses
  • RSeQC: comprehensively evaluate high throughput sequence data especially RNA-seq data
  • incf-omni: analysis and simulation construction of the nervous system
  • genetrack: storing, querying and visualizing genomic interval oriented data
  • chimerascan: detection of chimeric transcripts in high-throughput sequencing data

Since you're new to the field of bioinformatics, you might also be interested in:

  • ANGUS, a site built around the 2010 course on Analyzing Next-Generation Sequencing Data. It contains a number of detailed tutorials on mapping, assembly, mRNAseq, ChIP-seq, and resequencing analysis using Python.
  • this article by Peter Norvig on species barcoding

To give another example of the very valid point that Dk made: the company I work for (Applied Maths) sells a bioinformatics software suite called BioNumerics. The core of the program is written in C++, but Python is used to customize the software to specific clients' needs:

  • to create custom reports,
  • to import and export non-standard formats,
  • to automate series of actions that are executed repeatedly,
  • to perform custom calculations, etc.
ADD COMMENT
3
Entering edit mode
12.4 years ago

Most commonly used tools are written in compiled languages like C or java simply because they run faster and the ability to access low level memory resources are crucial to analyzing large amounts of data. When python is used in these packages, it is usually in the form of 'pipeline glue'.

Tophat (http://tophat.cbcb.umd.edu/) is a perfect example of that. It consist of several smaller programs written in C. Python is then used to interpret user paramters and run the smaller programs in sequence.

Interpreted languages like python or perl are usually used for format conversions or statistics reporting.

Good place to start for real examples is to read up on BioPython (http://biopython.org/wiki/Biopython). Their tutorials have tons of real life examples. You can come up with small projects for yourself like writing a script that analyzes gc content of a fasta file, or a script that parses a blast output file and filter on various criteria.

ADD COMMENT
1
Entering edit mode

I will chime in to say QIIME (http://www.qiime.org) is another example.

ADD REPLY
1
Entering edit mode

I believe you're right about the speed consideration, the ability of C or C++ to access low level RAM...etc lets one possibilities to tune a program as close to the hardware as possible (one can also try assembler), but I'm sure the way of coding to achieve a specific task is more critical. Look at, for instance, this biostar discussion. http://www.biostars.org/post/show/10353/how-to-efficiently-parse-a-huge-fastq-file/ (Leszek answer). For the thread interest I would say: Python is good, but use dict() and set() types instead of lists whenever you can.

ADD REPLY
1
Entering edit mode

My answer is malformated due to the transition of the website. If you read my answer together with reformated table in a separate answer, you will know a proper C/C++ implementation is 4-fold faster than Leszek's script. The C++ one is slow due to a stdio synchronization issue which I only know recently. Also, each data structure has its own use. It is just in that example dict() is better.

ADD REPLY
0
Entering edit mode

OK, I see. When I read that answer properly the last time, the best implementation race was not over yet :-) However, that still supports the fact the way of coding is very critical, whatever the programming language. That is a very good post, I like biostar especially for that kind of these. BTW, I'll compile right back Pierre's code.

ADD REPLY
0
Entering edit mode

on formatting: a new fix is incoming will be applied over the weekend most likely

ADD REPLY
1
Entering edit mode
12.4 years ago
Adam ★ 1.0k

The short-read mapper, Stampy, is written in Python. http://www.well.ox.ac.uk/project-stampy

ADD COMMENT
0
Entering edit mode
12.4 years ago
Woa ★ 2.9k

I would suggest search google or google scholar with your topic of interest plus something like "python script" or "python code" eg.

protein structure superposition + "python script"

ADD COMMENT
0
Entering edit mode
12.4 years ago

The Biopieces suite is made of python and ruby.

ADD COMMENT
1
Entering edit mode

Very little is in Python, yet. Most is Perl and Ruby.

ADD REPLY
0
Entering edit mode
10.3 years ago
sgruenwald ▴ 10

A good way to find interesting modules is to search the pip python libraries:

https://pypi.python.org/pypi?%3Aaction=search&term=bio&submit=search

ADD COMMENT
0
Entering edit mode
10.3 years ago

Our Go-Elite Gene Ontology and general gene set overrepresentation analysis tool is written in Python. It was described here.

ADD COMMENT
0
Entering edit mode
5.0 years ago

Best way of finding python modules for bioinformatics is to use the classifier option (https://pypi.org/classifiers/) in pypi.

https://pypi.org/search/?q=&o=&c=Topic+%3A%3A+Scientific%2FEngineering+%3A%3A+Bio-Informatics

ADD COMMENT

Login before adding your answer.

Traffic: 2822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6