Question

Integrating HTSeq count data of different samples

1

Entering edit mode

10.4 years ago

kozhaki.seq ▴ 60

I make a resource to estimate the gene expression levels across many plant tissues using the RNASeq data . I have collected the dataset of different experimental samples from GEO and other sources. Now, Using HTSeq, I estimate the count for each sample (i.e., samples from different experiment). Finally, I merge all the dataset to a single source, so that the expression level of a gene can be viewed across all samples (using heatmap of count data). But, I concern about the significance of my method. Could anyone tell about my strategy?

I have two specific doubt,

Is it significant to merge the data since the different experiment may have the 'batch effect'?
If it is ok to merge sample, I should consider the HTSeq count data or FPKM for the heatmap?

Thanks

FPKM HTSeq RNASeq • 5.7k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by kozhaki.seq ▴ 60

0

Entering edit mode

What do you mean by merge samples ?

Generally it should be ok to take different GEO data sets and compare them provided they are similar type of experimental designs and different conditions/cell lines.

What is the variation in terms of number of reads per sample across different samples ?

You need to normalise the data before you plot any heatmaps.

ADD REPLY • link 10.4 years ago by GouthamAtla 12k

Ram · Answer 1 · 2015-03-12

1

Entering edit mode

10.4 years ago

mark.ziemann ★ 2.0k

Is it significant to merge the data since the different experiment may have the 'batch effect'?

Yes there will be a batch effect due to many technical reasons, but unless you're going to perform the experiment again, then you don't have much choice. Still, I would recommend validating some of the major findings in your own plant tissues with a method like RT-qPCR to show that the RNA-seq trends are real.
If it is ok to merge sample, I should consider the HTSeq count data or FPKM for the hheatmap?

As Geek_y states, you do need to normalise the data because each dataset will have different number of tags. FPKM is a widely accepted method for doing this.

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by mark.ziemann ★ 2.0k

3

Entering edit mode

Fpkm is not normalisation. Always normalise. See for example deseq2 or edgR packages in R.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Danielk ▴ 640

2

Entering edit mode

Presumably you meant, "FPKMs are not widely accepted as normalized values", which would indeed be true. They are normalized values, it's just that the method is easily biased and the resulting values less useful for statistics.

ADD REPLY • link updated 10.4 years ago by Sean Davis 27k • written 10.4 years ago by Devon Ryan 105k

1

Entering edit mode

True Devon & Danielk, FPKM is not a robust method for determining differential expression but would be OK for visualisation of genes of interest in a heatmap as the OP requires.

ADD REPLY • link 10.4 years ago by mark.ziemann ★ 2.0k

Ram · Answer 2 · 2015-03-15

Danielk, Devon, and Mark are right. TMM (edgeR) & DESeq are much better than FPKM. The below is a good paper and its conclusion for your reference.

Dillies, et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671-683.

Key points

Normalization of RNA-seq data in the context of differential analysis is essential in order to account for the presence of systematic variation between samples as well as differences in library composition.
The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.
Only the DESeq and TMM normalization methods are robust to the presence of different library sizes and widely different library compositions, both of which are typical of real RNA-seq data.

Ram · Answer 3 · 2015-03-16

I don't know which software you want to use further. But people more like to use DESeq, edgeR and limma-voom for normalization and DEG analysis. In this three software, a size factor will be calculated for every sample, and normalized samples by their own size factor. If you data come from a same batch, you can just export the counts matrix out after normalization. If not, when you do DEG analysis, or other kind of analysis, condition and batch inference influence should be considered as same time. When you make the design matrix, it will like d1=model.matrix( ~-1+ condition+batch, data), d0=model.matrix( ~-1+ condition, data). Then use d1 and d2 to build model1 and model0 (how to build depends on software you use). model1-model0 is the model without batch influence.