I am well familiar with microarray-based transcriptomics but don't have much experience with RNA-Seq.
I am interested in published transcriptomics data, found in the GEO database at the NCBI. For microarray projects, one can download the data in various formats that I can work with. However, for RNA-Seq projects, the GEO database offers only the download as "bedgraph" files. I read and understand what these are, but I am not sure how to use them for analyzing transcriptomics data.
I expected some output with gene names and expression values for the different conditions. What I get is a bedgraph format (one track per condition) The GEO data is not human, and the first three columns of the bedgraph file are supposed to contain position information. This is a small section of one of the files:
track type=bedGraph name="TopHat - read coverage"
C36799851 0 27 0
C36799851 27 98 1
C36800049 0 0 0
C36800049 0 1 2
I understand that these files are meant to be displayed in the UCSC genome browser. I tried to upload the file, but got an error message about too little memorey (the bedgraph files are huge). So, my first two questions are:
- how am I supposed to find the correct genome browser that maches the bedgraph file? I know the organism, but there might be different versions, releases etc
- should I use the 'upload' function, and what can I do about the memory problem?
My most important question, however, is more fundamental. Even if I manage to display multiple tracks like this in the browser, how can I make sense of these data, e.g. by searching for genes that show big expression changes between two conditions? There must be a solution without using a genome browser - maybe by mapping the positional information in the bedgraph files to the genes.
Any idea?
If you are interested in using this data then you should find original fastq files, do alignment/counting yourself instead of depending on these derived files.
Ok, but I assume the derived files must be good for SOMETHING. Otherwise, they wouldn't offer them for download.
Yes, for visualization on a genome browser. For anything else they are utterly useless. You really should get raw data and get raw counts from that.
Ok. three aspects: i) as explained by Luis Nassar and the documents he links to, the GEO file as such cannot be displayed in a genome browser, it has be be converted to a 'bigwig' file and put on a publicly accessible web server (which I don't have). If the GEO database offers this file for display in a genome browser, why don't they just offer a bigwig file directly on one of their servers? ii) I tried to get the raw data, but there is no easy path from the GEO entry to some FASTQ file. Maybe there is one, but I just don't get it. iii) with the microarray based projects, GEO offers a path to the raw data, but they also offer processed data in several formats. I am not talking about 'analyzed data', but some kind of text format that has gene names and intensities for the different conditions.