Question

How To Calculate The Gene Expression Level Based On Rna-Seq Experiment In Encode For Protein Coding Gene In Gencode

0

Entering edit mode

11.2 years ago

Michael Z • 0

Currently I am using the protein-coding genes from GENCODE as a stable source of gene annotations, however when I try to find gene expression data for these genes, I am confused.

There are RNA-seq data for each cell line, such as K562, and it has a bigWig file called "Transcription of K562 cells from ENCODE" which seems like the expression level on some scale, but I do not find the detailed information about how they calculated it.

Forgive me if it is a simple question, I am completely new to the RNA-seq: should I start from the bam files of alignment for each replicates of the RNA-seq, and count how many of the reads falling on the gene body regions, divided by the total number of reads in the replicate to get the RPKM?

Or can I simply use the value from the bigWig files, then use the sum of the values falling on a gene body, and do some extra normalization?

Thanks!

rna-seq gene-expression • 7.6k views

ADD COMMENT • link updated 5.7 years ago by Biostar 20 • written 11.2 years ago by Michael Z • 0

score 1 · Answer 1 · 2013-10-28

You can't use bigWig files to do counting. There's no way to figure out how many reads generated the pileup at a particular position. You will need to use mapped data. However, note that ENCODE's mappings used TopHat 1.0.14 which had some important bugs in it. One of them was it would map to pseudogenes instead of splice junctions, even if the mapping was better to the splice junction. Map the reads to the genome using the latest version of TopHat, or alternatively, STAR.

score 1 · Answer 2 · 2013-10-28

Dario is right, Easiest is that you can wget the fastq file of K562 from Encode, fetch genome sequence and index it by bowtie2, run tophat2 on it. In output, you would see a accepted_hits.bam file, Now you need to run Cufflinks on it providing proper gtf files where your protein coding genes are there from genecode. if you do with replicates then Cuffcompare it. then the final gtf file would be containing FPKM values for each genes in dataset.

score 0 · Answer 3 · 2013-10-30

0

Entering edit mode

11.2 years ago

Michael Z • 0

Thanks Dario and Manu, I will try your methods!

ADD COMMENT • link 11.2 years ago by Michael Z • 0