I am using edgeR to analyse a RNAseq data and I do not know if I should keep the information below for the analysis:
__no_feature
__ambiguous
__too_low_aQual
__not_aligned
__alignment_not_unique
I think I should keep them because I think edgeR need the information of the reads counted in these features to calculate CPM and logCPM but I am not sure.
Can someone help me?
Thank you
Are you sure edgeR uses these last 5 lines with
__
prefix? I don't think those are important. Do you have any source for your claim?For your information, here is the edgeR code: https://rdrr.io/bioc/edgeR/src/R/readDGE.R
As you can see a warning will be raised whenever a line starting with
__
is read, as those are not actual gene lines.I understand your point. But in this case how can edgeR know the total number of reads to calculate CPM, for example?
The total mapped reads number included those information. No?!
Thank you for helping.
CPM is calculated versus the total number of reads
assigned
to a gene, not necessarily the total number of (mapped) reads..I am analysing miRNAs. If I cut those lines my CPM will be "number of reads/number of miRNA aligned reads". But in most of the cases I prefered "number of reads/number of aligned reads". For exemple, if I want to compare miRNAs with piRNAs in the same sample, if I cut those lines it is not possible to know the difference between the the total number of miRNAs e total number of piRNAs. For exemplo which of them is more expressed in my sample.
I think the best option is to keep "no feature" and "ambiguos" to have the number of total reads aligned by the aligner. Am I wrong?
Sum of mapped reads in each sample?
Sorry, I didn't understand. Did you keep those lines in the first case and deleted them before use DGElist? If this is the case I agree with that.
More specifically, my question is: if I exclude those lines will edger normalize the data by the total number of mapped reads or by the number of reads mapped as genes (or something else, like miRNA, in my case?
Sorry for being unclear, I meant that the input to edgeR is the entire HTSeq output, it doesn't need to be edited prior to running the script.