Question

GEO database - how to use their bedgraph files

0

Entering edit mode

5.2 years ago

Suicyte ▴ 10

I am well familiar with microarray-based transcriptomics but don't have much experience with RNA-Seq.

I am interested in published transcriptomics data, found in the GEO database at the NCBI. For microarray projects, one can download the data in various formats that I can work with. However, for RNA-Seq projects, the GEO database offers only the download as "bedgraph" files. I read and understand what these are, but I am not sure how to use them for analyzing transcriptomics data.

I expected some output with gene names and expression values for the different conditions. What I get is a bedgraph format (one track per condition) The GEO data is not human, and the first three columns of the bedgraph file are supposed to contain position information. This is a small section of one of the files:

track type=bedGraph name="TopHat - read coverage"
C36799851       0       27      0
C36799851       27      98      1
C36800049       0       0       0
C36800049       0       1       2

I understand that these files are meant to be displayed in the UCSC genome browser. I tried to upload the file, but got an error message about too little memorey (the bedgraph files are huge). So, my first two questions are:

how am I supposed to find the correct genome browser that maches the bedgraph file? I know the organism, but there might be different versions, releases etc
should I use the 'upload' function, and what can I do about the memory problem?

My most important question, however, is more fundamental. Even if I manage to display multiple tracks like this in the browser, how can I make sense of these data, e.g. by searching for genes that show big expression changes between two conditions? There must be a solution without using a genome browser - maybe by mapping the positional information in the bedgraph files to the genes.

Any idea?

RNA-Seq bedgraph • 3.2k views

ADD COMMENT • link updated 5.2 years ago by Luis Nassar ▴ 670 • written 5.2 years ago by Suicyte ▴ 10

0

Entering edit mode

If you are interested in using this data then you should find original fastq files, do alignment/counting yourself instead of depending on these derived files.

ADD REPLY • link 5.2 years ago by GenoMax 148k

0

Entering edit mode

Ok, but I assume the derived files must be good for SOMETHING. Otherwise, they wouldn't offer them for download.

ADD REPLY • link 5.2 years ago by Suicyte ▴ 10

0

Entering edit mode

Yes, for visualization on a genome browser. For anything else they are utterly useless. You really should get raw data and get raw counts from that.

ADD REPLY • link 5.2 years ago by ATpoint 86k

0

Entering edit mode

Ok. three aspects: i) as explained by Luis Nassar and the documents he links to, the GEO file as such cannot be displayed in a genome browser, it has be be converted to a 'bigwig' file and put on a publicly accessible web server (which I don't have). If the GEO database offers this file for display in a genome browser, why don't they just offer a bigwig file directly on one of their servers? ii) I tried to get the raw data, but there is no easy path from the GEO entry to some FASTQ file. Maybe there is one, but I just don't get it. iii) with the microarray based projects, GEO offers a path to the raw data, but they also offer processed data in several formats. I am not talking about 'analyzed data', but some kind of text format that has gene names and intensities for the different conditions.

ADD REPLY • link 5.2 years ago by Suicyte ▴ 10

score 3 · Answer 1 · 2019-10-30

Hello,

To answer your first question regarding the correct Genome Browser, you will have to check the assembly used in GEO and see if we have the corresponding assembly for that organism as a native assembly. You can see this in the organism gateway page, e.x. https://genome.ucsc.edu/cgi-bin/hgGateway. This page includes an NCBI assembly accession number which should match. If you are unsure, you can send us the assembly or GEO page and we can check.

If we do not have the assembly, you have the option of creating an assembly hub (https://genome.ucsc.edu/goldenpath/help/hubQuickStartAssembly.html), though that may not be worth doing if you just want to get to the raw data.

Regarding the second question, we have a limit on the size of files that can be uploaded as custom tracks. After that limit, you would have to create a big* data track, hosted in a remote location. In the case of bedGraph (https://genome.ucsc.edu/goldenPath/help/bedgraph.html), you would convert it to bigWig. See the following page and example for more information (https://genome.ucsc.edu/goldenPath/help/bigWig.html#Ex3). Here is a help page on hosting as well (https://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html#Hosting).

Once you have the data in the Genome Browser, you could do additional manipulations more than just visualization. For example, intersecting the data with Gene Tracks (if they are available for that assembly), using things like the Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables).

As you have said, however, if you are just trying to map the data to genes there may be other more direct approaches to take.

Hopefully this answers some of your questions. If you have additional questions regarding the Genome Browser, the best way to reach us is to email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

We do periodically check biostars, in which case the UCSC tag is helpful.