Dear all,
I have 40 liver cancer samples. Among them I have two groups. GroupA (26 samples) and GroupB (14 samples) and I have a total of 10,000 lncRNA transcripts. I'm trying to do differential analysis between GroupA vs GroupB using edgeR
.
With edgeR analysis I see that there are around 300 upregulated lncRNA transcripts in GroupA vs GroupB. But I see that all these upregulated lncRNA transcripts are not expressed in all GroupA samples (highly expressed only in few samples), and moreover I see that these are not at all expressed in GroupB which means zero counts in GroupB.
I took a specific lncRNA transcript and made a box plot which looks like below:
There are many other upregulated lncRNA transcripts expressed only in few samples of GroupA and no expression in GroupB for differential analysis between GroupA vs GroupB.
What I should do now? Can I carry on with further analysis with those 300 upregulated lncRNA transcripts? Or Is there any other way to get true differentially expressed genes?
These are "true" differentially expressed genes. It is a group mean that is significant, not every single event. You can use
FilterByExpr
and try to tweak parameters there to try and eliminate samples with many low counts. Or you can manually require that at least x% of samples per group have at least y counts. Maybe 50% per group must have at least 10 raw counts or so.thank you...I actually used Filtering step from
edgeR
already likekeep <- rowSums(cpm(y) > 1) >= sample number
. With these among 10,000 I kept around 7000 lncRNA transcripts which are used for DEA.In there is other transcript in below box plot Is this still said as upregulated in GroupA?
thank you.
So are you performing differential analysis of just lncRNAs ? May be that’s wrong. You need to use all genes quantified.
I'm interested in only lncRNAs. I don't think it is wrong.
Are you performing
cpm
on total number of reads just mapped to lncRNAs ? What if you don’t have a million reads mapped in a sample ? Your library size is total reads or reads mapped only to lncRNAs ?These 10,000 lncRNA transcripts were novel transcripts in my analysis. So, yes I used
cpm
only.Be aware that you should perform analysis on all genes and then select the interesting ones from the results. I hope you did not run edgeR on only the novel transcripts. The power of edgeR comes from sharing information across genes so you benefit from a large number of genes and many of those being highly-expressed and not changing. I suggest you include the novel transcripts (or genes depending on what exactly they are) to the set of genes already present in the reference annotation and repeat analysis.
From the pipeline lncRNApipe I used, to detect novel lncRNAs....I found approx 10,000 novel lncRNA transcripts. And this is what I used with edgeR for GroupA vs GroupB comparison.
Don't know what else I should use.
In that case, at least you need to provide the library sizes (sequencing depth of each sample) as a vector to edgeR cpm function (
lib.size
) for the appropriate normalisation. Or merge your novel lncRNAs with known annotations (protein-coding genes and known lncRNAs) to create a final GTF and perform differential analysis all together.Yes, I added
lib.size
forcpm
. I'm afraid that we if I merge these novel transcripts with mRNA and Known lncRNAs, I won't be able to detect any novel differentially expressed transcripts. Because mRNAs are high expressed and lncRNAs are low expressed.I see that library sizes of the samples is very low. There are two samples with very high raw counts. I feel like doing downsampling to these two samples (GroupA_12 and GroupB_2). May I know how to this downsampling before DEA? Please let me know.
@geek_y and @ATpoint Could you please check my below comment and tell me your opinion.thanq
Hi, your lib sizes are too small because you are calculating only the lncRNA reads. You can get total number of reads in a bam file and use that number as library size for each sample. Otherwise I am not sure if CPM makes sense (Thats why logCPM values in your boxplots looks very high. If you have less than a million reads and you are scaling it to a million reads, which artificially inflates your signals).