Any Good Bed/Wig Parsers In Python ?
6
11
Entering edit mode
13.7 years ago
Ingineer ▴ 120

I am probably going to be the N-th bioinformatician to write the N+1th BED parser so I was wondering, before I start, what good options exist for the python language ? The idea is not just a script that will convert BED to some other format, but a parser that would transform the interval information into an abstract structure consumable by the program.

In my dreams, I see a library like this:

# This super library should be created
import genefiles
# The file has no .bed extension but the library is
# smart enough to guess and recognize what it is.
mydata = genefiles.read("/home/bob/genes")

# Now we have a variable with all the info, on which we can iterate
mydata.type
>>> 'qualitative'
mydata.chromosomes
>>> 'chr1', 'chr2', 'chr3', ...
len(mydata.chr1)
>>> 5671
mydata.chr1[1]
>>> {'start': 100, 'stop': 250, 'score': 15.6, 'strand': '+', 'name': 'NS12'}
mydata.chr1[2]
>>> {'start': 400, 'stop': 500, 'score': 0.7, 'strand': '-', 'name': 'NS45'}

# Nothing stops us from writing a list comprehension now !
smallgenes = [g['name'] for g in mydata.chr1 if g['end']-g['start'] < 100]
>>> ['NS88', 'NS76', 'NS112']

Same would work for loading WIG/GFF etc. files. What do you guys use ?

python parsing bed wiggle • 16k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
14
Entering edit mode
13.7 years ago

bx-python handles a number of these formats, including BigWig and BigBed which are nice ways to store your interval and score data compressed on disk but still have random access:

Most libraries will be iterative parsers to help be memory conscious on large files. BigWig and BigBed get the closest to your dream API without requiring the entire file be loaded in memory.

ADD COMMENT
2
Entering edit mode

Awesome, thanks for these. I submitted a patch to bx-python to handle the non-tab cases: https://bitbucket.org/james_taylor/bx-python/issue/27/allow-backup-to-space-separated-files-in There was already a flag to allow for your problematic strand case, so if you do: GenomicIntervalReader(file, fix_strand=True, allow_spaces=True) with this patch all of your files will pass. The 14_fields example still passes (instead of your expected fail), but in my opinion this behavior is more correct since people abuse the format by adding extra columns.

ADD REPLY
1
Entering edit mode

Integrated the tests into the gMiner project and added other tests, if you want to use them further or apply them to some other library you can find the latest version here: https://github.com/bbcf/gMiner/tree/master/Extras/tracks/qual/bed/should_pass

ADD REPLY
0
Entering edit mode

My ideal BED or WIG parser would of course be iterative, yes.

I looked at the BED parser code you linked to. It seams to be extremely unrobust and incompatible with the UCSC definition of the format. http://genome.ucsc.edu/FAQ/FAQformat.html#format1

I probably won't be taking this one.

ADD REPLY
0
Entering edit mode

What I mean by robust is: What if the separator is a space instead of a tab ? What if there is a comment at the end of a line ? What if the BED has only three fields ? Do you crash, raise an error, or proceed none the less ?

ADD REPLY
0
Entering edit mode

By my reading, the bx-python handles all those cases. It splits on whitespace. It only uses the initial fields so extra comments at the end of a line would be ignored. It handles 3, 4 and 5 field BED. The only addition I could suggest is capturing strand as well. If you find cases that the parser fails on, why not write up test cases and bring them up to the author to help improve the library?

ADD REPLY
0
Entering edit mode

Challenge accepted. Try my script here: http://freesv.ch/bed_tester.tar.gz Results are better than I thought. But one must realize that many tests pass just because the information is discarded by the parser.

ADD REPLY
0
Entering edit mode

Integrated the tests into the gMiner project and added other tests, if you want to use them further or apply them to some other library you can find the latest version here: https://github.com/bbcf/gMiner/tree/master/Extras/tests/bed_files

ADD REPLY
8
Entering edit mode
13.7 years ago
brentp 24k

There is also a Bed parser in BedTools which you can access from the python/cython wrapper bedtools-python

There's another python wrapper pybedtools that's mostly for calling the bedtools executables, but it does have a feature module (that could be stand-alone) for Bed, GFF, GTF, and other feature "parsing".

And, there's a Bed module here that puts the file into a numpy array and allows you to do things like your list comprehension above

   b = Bed('some.bed')
   small_genes = b[b['end'] - b['start'] < 100]
ADD COMMENT
0
Entering edit mode

Interesting tools ! The bedtools-python seams to work well but is written in c++ unfortunately, making it hard for me to modify or integrate. The pybedtools module really seams like just a wrapper. Finally, rhe numpy array is more interesting, but is not robust or iterative.

ADD REPLY
0
Entering edit mode

I'd say that pybedtools is more than "just a wrapper" if you try it out and find it insufficient, you could report to the authors. Likewise, if you report how you find the numpy array versionto be less robust, perhaps it could be improved.

ADD REPLY
7
Entering edit mode
13.3 years ago
Ryan Dale 5.0k

Shameless plug: pybedtools now does almost everything you asked (with the exception of the chromosome attribute access), and it's in Cython for speed.

For example, say we have this BED file:

$ cat a.bed
chr1    1    100    feature1    0    +
chr1    100    200    feature2    0    +
chr1    150    500    feature3    0    -
chr1    900    950    feature4    0    +

And this GFF file:

$ cat d.gff
chr1    fake    gene    50    300    .    +    .    ID=gene1
chr1    fake    mRNA    50    300    .    +    .    ID=mRNA1;Parent=gene1;
chr1    fake    CDS    75    150    .    +    .    ID=CDS1;Parent=mRNA1;
chr1    fake    CDS    200    275    .    +    .    ID=CDS2;Parent=mRNA1;
chr1    fake    rRNA    1200    1275    .    +    .    ID=rRNA1;

Then:

>>> from pybedtools import BedTool

pybedtools has auto-detection of BED/GFF/VCF based on position of string and integer fields:

>>> a = BedTool('a.bed')
>>> d = BedTool('d.gff')

>>> a.file_type
'bed'

>>> d.file_type
'gff'

BedTools are iterable, and you get an Interval for each feature. So you can do list comprehension:

>>> [i.name, i.strand) for i in a if len(i) < 150]
[('feature1', '+'), ('feature2', '+'), ('feature4', '+')]

or indexing:

>>> feature = d[0]

You have GFF attribute access (lazily parsed):

>>> feature['ID']
'gene1'

And arbitrary access of fields:

>>> d[1].fields[1]
'mRNA'

...not to mention access to all of BEDTools:

>>> not_in_d = a.intersect(d, v=True)
>>> print not_in_d
chr1    900    950    feature4    0    +
ADD COMMENT
5
Entering edit mode
13.7 years ago

At least a GFF parser is available with the newer versions of BioPython. For code examples see their wiki.

There is a question about converting BED to GFF and some tools for other conversions have also been suggested here by Brad.

ADD COMMENT
1
Entering edit mode

Michael, the GFF parser isn't yet rolled into Biopython; the code you linked to is actually older stuff that hasn't been maintained and we'll hope to replace with the new parser. The GFF code is available from: https://github.com/chapmanb/bcbb/tree/master/gff and works as described in the wiki. Thanks for mentioning it.

ADD REPLY
1
Entering edit mode

Just tried the GFF parser, works nicely for me!

ADD REPLY
0
Entering edit mode

Thanks for pointing that out, I haven't worked with it myself so far.

ADD REPLY
0
Entering edit mode

Probably because Brad put it there in the meantime ;-)

ADD REPLY
3
Entering edit mode
12.8 years ago
Xapple ▴ 30

I just finished work on a python package that can read most of the genomic file formats such as BED, WIG, GFF, BedGraph in a simple and standard way.

It's free and open-source. It will read many different kinds of formats in the same simple syntax:

import track
with track.load('tracks/rp_genes.bed') as rp:
    data = rp.read('chr3')

You install it by typing:

$ sudo easy_install track

Particular attention was brought to the documentation:

http://xapple.github.com/track/

Please tell me what you think about it !

ADD COMMENT
1
Entering edit mode

Good documentation, and easy to install. Does it support interval operations like the intersection of two BEDs? Don't see that in the documentation

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
13.7 years ago
Andreas ★ 2.5k

Maybe the following helps

Never tried them myself though.

Andreas

ADD COMMENT
0
Entering edit mode

Yeah, that's one more quick and dirty BED parser that is not very robust.

ADD REPLY

Login before adding your answer.

Traffic: 2434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6