Differential Expression of Targeted RNA-Seq
1
6
Entering edit mode
7.1 years ago

Hey there,

I am writing my master thesis at the moment about targeted RNA-Seq. I was wondering if there are any differences in the data analysis to whole genome RNA-Seq. In my Panel are about 200 genes from mTOR and TLR pathway and 12 housekeeping genes.

I used Ion Torrent Suite to get counts of these genes and did differential expression analysis with edgeR and found genes. Now my Question is, are the statistical assumptions for whole transcriptome likely to be sufficient accurately to use them for this small extract?

Best regards, Niklas

PS: This is my first question here, so if I did sth. wrong, please tell me :)

rna-seq R • 3.9k views
ADD COMMENT
0
Entering edit mode

I guess I just don't see the added value. People do targeted DNA / custom capture because it's much less expensive and laborious then doing WGS. In most instances, 30M reads is overkill for human RNAseq and that's only going to cost about $250 for the prep and ~$1000-2000 for the 300M HiSeq lane. Not sure I see what the upside or application of this is.

ADD REPLY
0
Entering edit mode

Well, I am bound to ressources we have on our university, and a PGM is not capable of sequence replicates of human transcriptomes with enough coverage, thats the reason.

ADD REPLY
1
Entering edit mode
7.1 years ago

My only concern with targeted experiments is that the library-size normalization is (A) decreasingly robust and (B) more likely to have its assumptions violated. (B) is the most important one there. My fear is that people are going to increasingly run into cases like with Myc, where you have global changes due to a treatment. This sort of thing will result in the DE results being very wrong. I presume you're normalizing to the house keeping genes, or at least checking to ensure they're not DE, which is a wise decision (if not, please do that!).

BTW, nice first post and welcome to the site!

ADD COMMENT
0
Entering edit mode

Hi Devon,

Would you mind expanding a bit on (A), I'm curious what your thoughts are on this. Are you just talking about size factor calculation, or something more generalizable to TPM/RPKM/FPKM? I can see the issue with size factor, as that should probably be avoided outright, but something like TPM should be fairly robust so long as you do between sample normalizations as well.

M

ADD REPLY
0
Entering edit mode

It's just the size factor calculation. That's more robust as you increase the number of genes/transcripts measured. The "between sample normalization" is exactly the size factor, so TPM/CPM are also affected.

ADD REPLY
0
Entering edit mode

I believe size factor is described as follows (deseq manual):

Given  a  matrix  or  data  frame  of  count  data,  this  function  estimates  the  size  factors  as  follows:
Each column is divided by the geometric means of the rows. The median (or, ir requested, another
location estimator) of these ratios (skipping the genes with a geometric mean of zero) is used as the
size factor for this column.

That's different from TPM/RPKM which use a reads per million scaling factor to normalize within sample. TPM takes RPKM a step further and ensures the integration of per sample TPMs are equal across all samples. But still that shouldn't necessarily be seen as an optimal between subject normalization. But I was thinking something different for between sample normalizations, eg., quantile normalization, vst, rlog, something like that, but I'm unsure how those types of procedures are affected by a reduction in the number of quantified features. Might be something interesting to look into.

ADD REPLY
0
Entering edit mode

It's not uncommon (especially for RPKM/FPKM), to have the counts between samples scaled first by DESeq2/edgeR/etc. and then RPKM/FPKM converted.

Regarding quantile normalization, you can certainly try that too. However, you still have all of the same caveats.

ADD REPLY
0
Entering edit mode

I tried the suggestion of edgeR manual to estimate the common dispersion with these HGK (0.2572). Suprisingly, it did not differ from dispersion of all genes (0.2565). So maybe the library sizes are big enough for this estimation. I´m not deep enough into statistics to make a proposal about this.

ADD REPLY
0
Entering edit mode

I was reading this post and I have some doubts regarding my case. I am using edger, I have 36 samples, 12 normal, 12 benign tumor and 12 malign tumor. On the other hand, I have 60 genes (~1200 exons), all related with cancer. I was doing the normalization with all the samples, would this approach be correct? better use only normal samples for normalization? I don't have house keeping genes.

Best

ADD REPLY
1
Entering edit mode

It's impossible to use only normal samples for normalization, you have to use all of the samples. The only question becomes whether you include all groups or just a subset of them. In general you're better off using all of the groups that you'll end up making comparisons of.

You very likely might have difficulty with 60 genes. I presume that these were chosen because they're expected to change, in which case the likelihood of the normlization assumptions being violated is increased significantly.

ADD REPLY
0
Entering edit mode

I see. The point is that those 60 genes are associated with cancer. So, do you think is it possible to do a differential expression analysis with this data? I have done it so far, and I get most of them differential expressed. After telling me this, I don't know if this is a reliable results.

Sincerely thanks a lot!!

Best

ADD REPLY
1
Entering edit mode

If you expect them to be more or less evenly split between up/down regulated then you can proceed. But if you expect an asymmetric change then you should be hesitant to believe the results. As a rule, all standard tools can be expected to break (i.e., produce unreliable output) if you have a consistent asymmetric change in expression between groups that affects the overwhelming majority of measured genes.

ADD REPLY
0
Entering edit mode

Sorry Devon, I noticed the resuts of differential expression with DESeq2 alters with and without housekeeping genes. I am using Edgeseq for 2556 onco biomarkers; So I am just wondering should I keep or remove housekeeping geges for differential expression? I heard DESeq2 by defaut does not need any housekeeping genes though

ADD REPLY
1
Entering edit mode

You don't have any reason to remove them and keeping them will likely aid in normalization, so keep them. DESeq2 doesn't have any concept of a house-keeping gene, but since many of the biomarkers could be changing you'll want to increase the number of genes assayed that won't.

ADD REPLY
1
Entering edit mode

Thanks a lot Devon, I picked one of my highly differentially genes by DESeq2 and calculated Wilcoxon test to see this gene is being differentially expressed between treatment and control or not where returned totally non-significant :(

Then I got confused is this gene is really being differentially expressed or Wilcoxon test is not a right test?

ADD REPLY
2
Entering edit mode

You mostly need enough samples. For the whole transcriptome shoot for more like 50+ per group for stuff like that.

ADD REPLY
0
Entering edit mode

I also encountered a situation where the genes were significant in DESeq2 but not significant using the Wilcoxon test. However, I want to create a boxplot, can I use the adjusted p-values obtained from DESeq2 for labeling?

ADD REPLY

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6