Hello -
I am trying to work with some very large BED Files obtained from GEO of ChIP-Seq results (Specifically - this data here). There are a few BED Files that I am hoping to visualize in addition to the MAC Peaks that were called. I wanted to try to use BioConductor to do this, but the BED Files (which contain upwards of 24 million aligned reads) cause R to choke. I am just starting with BioConductor.
What's the best way to visualize the reads with the peaks (called by MACS)? Is this something that can be done with BioConductor?
Thank you!
How long should it take to read in the bed file? This works no problem with smaller files but once they get up to 500 Mb or a few Gb it chokes. I am on a brand new macbook pro with 8 Gb of memory.
It shouldn't take more than a few minutes in my experience. However, you can check where the choke point is. Do you use "top" on your mac? Open a terminal (Terminal.app), type 'top' at the command line, then start your import process. By looking at top, you will be able to see how much memory is being consumed, and if you are running out. Type 'q' to quit top.
Still no luck in getting this to work. All my memory gets eaten up. Granted - some are large files (2-3gb). Am I missing something major here? I see how bioconductor could be useful ... but don't quite understand how this is the case if it can't work with large files. None of the tutorials cover this import process.... Although I would hope you don't have to load entire files in to work with them?
I am trying to zip and index the files with tabix right now...we'll see if that helps. Thanks for your suggestions.
update: zipping and indexing did not work.
You mention "All my memory gets eaten up." Have you examined the relationship between file size and memory? In other words, if you sample 1 million reads into a bed file (head -1000000 your.bed > new.bed) and try that, how much memory is used? Then try 2 million, how much is used? This will help you extraoplate the nature of your problem. I have noticed with some genomes that rtracklayer can consume resources an be inefficient (i.e. non-canonical genomes with thousands of contigs). If you're simply reading in a bed file, what happens to the tests above is you just read it in as a dataframe? You can always convert it to one of the efficient bionconductor objects after you have it in the R environment. Bioconductor can work with large files, but different packages and functions are not all guaranteed to work in the same hardware footprint.