Question

Nih Roadmap Epigenomics Project Data Listings

0

Entering edit mode

11.0 years ago

rig ▴ 20

Hi All, I'm just starting an MSc in computational biology, after completing a BSc in computer science, so my questions might seem trivial or just utter nonsense. Anyhow, what I want to do is to look at the entire human genome and ask some questions (Hopefully good ones :) ). The data will be taken from the following sites:

I'm encountering difficulties from the get-go. As I learned the data comes in various formats: sra,bam,bed,wig. I thought that each of the files is a different coding for the extracted dna, under my experiment of interest. Well...where is the dna? To be more specific I am having trouble grasping the different file formats:

bam is said to hold sequence aligned information, but the information as I understand is extracted from a single source and not compared to another, so what gives?
bed holds a list of features (am I correct to understand features as genes?), and their locations on some chromosome. First of all, I want the entire genes that are active, not just to a specific chromosome, so how do I obtain that? Secondly, can I assume that the represented features are the active genes for the cell type?
I have no idea what are the rest of the formats do, and how they suite my goals.

So, how can I receive an entire map of the active genes of a certain cell (like fetal brain cell)?

Thanks,

genome sra bam analysis • 2.9k views

ADD COMMENT • link updated 11.0 years ago by gammyknee ▴ 210 • written 11.0 years ago by rig ▴ 20

score 2 · Accepted Answer · 2013-11-28

I don't usually deal with model genome data types (such as for the human genome) but here's a rough explanation of what you have....

There are different types of data which can usually be classified by sequence, alignment or annotation. Firstly, raw sequences (from the genome sequencer) are usually contained in FASTQ files, which show a called sequence base along with its corresponding quality score. SRA format is a compressed sequence for storage on large global databases like NCBI. Have a look at NCBI sra-toolkit to convert these sea files to a fastq (single end or paired end sequences). A BAM file is an alignment file which shows all sequences that have been aligned to a particular genome (in your case its probably the human genome). Its a compressed file format so it is sometimes used to store unaligned sequences as well. Have a look at BAM to FASTQ conversion tools. BED files hold information (annotation) about the particular sequence that you have, or the particular genome that you want to analyse. Each feature may be a gene but could also be things like repetitive elements, non-coding RNAs, etc. (there are plenty of things in a genome other than genes).

WIG is a format that I've never used before but is probably an annotation file holding information from the UCSC genome browser (http://genome.ucsc.edu/goldenPath/help/wiggle.html)