I am wondering whether there is some tool/scripts available to create a database of genomic data from a set of files? These files are mostly in bed format, but there are also results files from peaks called with MACS and gene expression data from cufflinks.
Say I need a list of genes ( and coordinates) that have a peak for a certain factor, but not another, and genes genes must be expressed. At the moment, I use custom R scripts, which do a series of merges, annotations, and filtering steps to get the final table. However, this is not very elegant or efficient, and to me it seems like a classical problem that could be solved by implementing a database.
As it is often the case in bioinformatics, it is likely that someone already came up with a solution. Trouble is, I can't find it. So before I start implementing something like that myself, does anyone know of an existing solution?
how about storing your files as bzgip+tabix and querying everything using tabix or bedtools ?
I have not thought of tabix - well, I never really used it - but you got me curious.
From what I just read about it, each bed file would still be separate and the query would be done independently right? Let's say each bed represent different factor's binding sites, and I would like to know to which of my genes is bound by a factor but not another?
Building on what Pierre suggested... since you're already using R, are you using the genomeIntervals package? It has some of the functionality of bedtools and might be easy to integrate to your existing scripts. Although, it sounds like you're interested in using SQL and Python partly for the learning experience, so in that case go for it (if you have the time)!
I am using R, but not that package.
No one has the time :) But you are right. For me this would be a nice (and useful) side project to make something more large scale in python (already in use for smaller stuff), and learn SQL in the process (it would be implemented in sqlite). As mentioned, my current approach is good enough, but it somehow feels rough.
Note that the GenomicRanges package is also a good resource and is a bit more mainstream than genomeIntervals. Your mileage may vary, though. In python, consider the bx-python library.
In R I already use GRanges, but in Python so far it has been pybedtools. Very good package btw.
There is also the pygenes package in python.