I am probably going to be the N-th bioinformatician to write the N+1th BED parser so I was wondering, before I start, what good options exist for the python language ? The idea is not just a script that will convert BED to some other format, but a parser that would transform the interval information into an abstract structure consumable by the program.
In my dreams, I see a library like this:
# This super library should be created
import genefiles
# The file has no .bed extension but the library is
# smart enough to guess and recognize what it is.
mydata = genefiles.read("/home/bob/genes")
# Now we have a variable with all the info, on which we can iterate
mydata.type
>>> 'qualitative'
mydata.chromosomes
>>> 'chr1', 'chr2', 'chr3', ...
len(mydata.chr1)
>>> 5671
mydata.chr1[1]
>>> {'start': 100, 'stop': 250, 'score': 15.6, 'strand': '+', 'name': 'NS12'}
mydata.chr1[2]
>>> {'start': 400, 'stop': 500, 'score': 0.7, 'strand': '-', 'name': 'NS45'}
# Nothing stops us from writing a list comprehension now !
smallgenes = [g['name'] for g in mydata.chr1 if g['end']-g['start'] < 100]
>>> ['NS88', 'NS76', 'NS112']
Same would work for loading WIG/GFF etc. files. What do you guys use ?
Reddit cross-post discussion: http://www.reddit.com/r/bioinformatics/comments/gizhg/any_good_bedwig_parsers_in_python/