Question

gene expression analysis using RNA seq data

1

Entering edit mode

6.2 years ago

maryak ▴ 20

I am new with RNA seq. I want to identify Cancerous genes using RNA seq gene expression data but unable to understand what does the values in FPKM normalized data indicate e.g does zero means that gene is not expressed and larger values means gene is highly expressed? Will it be ok to use FPKM normalized RNA seq data for differential gene expression analysis? Plus the data I am using is log2 transformed...

Thanking in anticipation..

RNA-Seq gene sequencing • 5.9k views

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 6.2 years ago by maryak ▴ 20

score 2 · Answer 1 · 2018-09-21

2

Entering edit mode

6.2 years ago

Kevin Blighe 88k

Will it be ok to use FPKM normalized RNA seq data for differential gene expression analysis ????

It will not be okay to use FPKM values for differential expression analyses, even if you have seen it in some major publication. Statisticians / bio-statisticians are not accustom to peer review those publications, but they ought to.

Please read these:

1, What the FPKM? A review of RNA-Seq expression units

In bold, at the top (regarding RPKM, FPKM, TPM, and CPM):

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

2, A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

In their key points:

The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

[Note - FPKM is essentially the same as RPKM]

--------------------------------------------------------

As sophiespo mentions, try to obtain the raw counts and then feed these into one of the more serious differential expression analysis tools.

Kevin

ADD COMMENT • link 5.8 years ago by Kevin Blighe 88k

1

Entering edit mode

I very much agree that having p-values calculated from counts can be useful (and it important to be familiar with programs like edgeR / DESeq2), but I strongly disagree with the conclusion that "The community needs to abandon RPKM and FPKM data."

For me, I think having the independently calculated FPKM values can be extremely important in terms of testing the differential expression methods (which should be done for each project). For example, you have a gene that is known to be affected, shows clear differential expression in the FPKM values, but may not be differentially expressed with at least one method for your particular dataset. You can also visually compare heatmaps with genes identified using different methods this way.

I admittedly don't exactly have a formal publication to exactly show this (I am currently trying to help get the separate studies to the point of publication), but I do have a benchmark from a little while ago:

Warden et al. 2013 (click here for PDF)

In that situation, the number of samples is more than you would probably expect in a typical experiment, but the results match what I would typically expect in that log2(FPKM+0.1) values to be relatively more conservative than edgeR/DESeq2, and that matches most of what you see in Figure 6 (although there were some situations where the DESeq1 2-variable model, adjusting for patient, produced the gene list with the highest proportion of genes identified using all 3 methods in that study).

ADD REPLY • link 6.2 years ago by Charles Warden 8.3k

1

Entering edit mode

Thanks for the input Charles! I can indeed see the utility of RPKM / FPKM values in certain situations. As there are critics of FPKM, there are equally critics of DESeq, EdgeR, and Limma. There are even critics of the BWA aligner. It is good to have diverse input / ideas on things.

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

1

Entering edit mode

Hi Kevin,

Actually, the study above used a fairly large amount of RNA-Seq samples: ERP001058

I know that some of the methods change the workflow a bit depending upon the sample size, and it would probably be interesting to see the effects of sample size on parameter estimation in different situations.

However, I would probably say most of my more recent experience has been benchmarks with smaller datasets (less than ~30 samples), which is likely what I think most people would be dealing with.

Also, I can definitely see how there could be some confusion, particularly since there was a paired comparison with a smaller number of samples paired from different technologies (from GSE37704, in Figure 7), as well as supplemental figures that are easy to miss (link to PDF here)

-Charles

ADD REPLY • link 6.2 years ago by Charles Warden 8.3k

0

Entering edit mode

Actually i want to identify cancer related genes using python as programming language so you mean to say i must give raw counts data to my feature reduction algorithm and then classify??

ADD REPLY • link 6.2 years ago by maryak ▴ 20

1

Entering edit mode

All the mentionned packages (DESeq2, edgeR) are R packages. You should definitivelly try R for such analysis. Maybe you could use DESeq2 vsd transformed read count matrix for subsequent analysis in python

ADD REPLY • link 6.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I do not know what is your feature reduction algorithm, but I am highly doubtful that it performs any sort of normalisation. Trying to obtain statistical inferences from raw counts in any setting would be a bad move.

As per Nicholas. I would encourage you to make the most of the opportunity and to learn some R skills, too. You can use the transformed counts (variance stabilised or regularised log) from DESeq2 for most downstream applications.

Of course, if you need guidance with regard to R, et cetera, then let us know. We are all learning each day and it would be wrong of anyone to assume that they have already learned enough - even Full Professors near retirement are learning.

You may make an initial start from here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

I am using particle swarm optimization for feature selection.... and the data that i am using is also available in HTseq-Counts form some of the information regarding data is below.. type of data: gene expression RNAseq unit: log2(count+1) There is also a note written on the site that "Data from the same sample but from different vials/portions/analytes/aliquotes is averaged; data from different samples is combined into genomicMatrix; all data is then log2(x+1) transformed"

ADD REPLY • link 6.2 years ago by maryak ▴ 20

0

Entering edit mode

I see, I think that I know which data that is, in that case. It must be FPKM-UQ. In realising that FPKM was not ideal, the consortium created the FPKM-UQ method, which is like putting a cheap bandage on a deep wound. That data is normalised, though, and you would be able to use it for your algorithm. Just be aware that FPKM counts are not the greatest for differential expression analysis. Put it this way: if you used that data and ever wanted to publish, your paper would not be rejected. Best of luck.

As a final thing, just check the distribution of the data via histogram. It should generally follow a binomial distribution, which, I assume, the particle swarm optimization algorithm expects.

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks Kevin you are right its FPKM-UQ data.Now can you please guide me about what does the values in this data indicate i.e does 0 zero means gene is not expressed and some higher value means gene is highly expressed???

ADD REPLY • link 6.2 years ago by maryak ▴ 20

0

Entering edit mode

Yes, you can interpret it that zero means no expression. It follows, then, that higher values equate to higher expression. However, in FPKM, values are not comparable across different samples. In some extreme cases, FPKM 10 may be equivalent to FPKM 100 in another sample. You should consider converting the data to the Z-scale with the zFPKM R package, first.

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

If i use htseq-counts data that is log2(x+1) transformed then normalize it using edgeR or Deseq 2 will i able to identify cancer related genes?

ADD REPLY • link 6.2 years ago by maryak ▴ 20

1

Entering edit mode

If you just take HTseq counts, then these are 'raw' counts and are not normalised. You may mean normalised counts that derived from HTseq.

If you can just point me to the exact data that you're using, then it may help. It is obviously TCGA data, and I do feel sorry for anyone new who is trying to get a grasp of this [TCGA] data. It is 'sprayed' all over the World Wide Web and is in various states of processing on various third party websites.

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

https://xenabrowser.net/datapages/ i am talking about TCGA datasets being mentioned in this link.. if we open a particular cancer type then after that under "gene expression RNAseq" there are three form available HTseq-counts,Htseq-FPKM,HTseq-FPKM-UQ..Plz guide me which is the best option to use in order to identify cancer related genes..

ADD REPLY • link 6.2 years ago by maryak ▴ 20

1

Entering edit mode

HTseq-counts is the raw read count using htseq-count : https://htseq.readthedocs.io/en/master/count.html. So if you want to use DESeq2 or edgeR, you should use these files.

ADD REPLY • link 6.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Thanks Nicolas. Maryam, you must mean this:

HTSeq - Counts: these are the raw counts; however, Xena has incremented them and then logged (base 2). To get these back to their original form, you need to do:

(2 ^ (counts)) - 1

You would then normalise the values from this in EdgeR or DESeq2.

Conversely, if you take HTSeq - FPKM / FPKM-UQ, you can just convert these to Z-scores using zFPKM package, and then just use those.

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

yes you are right..After normalizing with DESeq2 or edgeR will i be able to apply particle swarm optimization on normalized data using python??can you please guide me how to use this normalized data with python? Secondly what effect will it make on normalization results if don,t reverse log and subtract 1

ADD REPLY • link 6.2 years ago by maryak ▴ 20

0

Entering edit mode

If you take just the HTseq - Counts data in its current form and just use that, then your conclusions from the data will be biased. Why? - because this data is raw and is normalised for neither within- nor cross-sample biases.

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Kevin i have tried to normalize HTseq count data using edgeR .I used calcNormFactors function to obtain TMM normalized value but my output file shows colums like samples.lib.size,samples.norm.factors but its not giving me TMM values can you please guide me how to do this??

ADD REPLY • link 6.2 years ago by maryak ▴ 20

0

Entering edit mode

Good work, dude! You should use the cpm() function in order to extract the normalised counts. Please see here: https://support.bioconductor.org/p/77193/

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for your kind words... CPM() will return me counts per million ?After that will i be able to perform cross sample analysis(differential expression analysis) on these values returned by CPM. i am unable to understand the logic behind it .

ADD REPLY • link 6.1 years ago by maryak ▴ 20

0

Entering edit mode

If you look at the vignette for EdgeR, there will be information on how you can conduct differential expression analysis. Take a look Here (page 19)

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

0

Entering edit mode

I just wanted to say thank you for your clarification. I've been trying to use the TCGA RNA-seq data in a different analysis pipeline, and I have found it such a pain to decipher exactly what they have done to their data.

ADD REPLY • link 5.9 years ago by Daniel Considine • 0

2

Entering edit mode

Did you look at this page? Things have been clearly explained as to how the data was processed.

ADD REPLY • link 5.9 years ago by GenoMax 147k

score 1 · Answer 2 · 2018-09-21

Hey I am adding these links so you can go through to understand the normalization of FPKM/RPKM and TPM- https://statquest.org/2015/07/09/rpkm-fpkm-and-tpm-clearly-explained/ https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ FPKM of a gene (say ABB) is paired end reads normalized to total reads (all genes) sequenced and identified for a sample, per kilobase of the gene ABB. Also since you're going to work with RNA-Seq data, you might find this link useful for further downstream analysis. https://support.bioconductor.org/p/53986/

score 1 · Answer 3 · 2018-09-21

1

Entering edit mode

6.2 years ago

sophiespo ▴ 90

The most popular differential expression algorithms require that you supply raw counts, not FPKM.

DeSeq2

edgeR

Here's a good differential expression workflow using edgeR in R

ADD COMMENT • link 6.2 years ago by sophiespo ▴ 90