Should I keep all HTSeq table informations for edgeR analysis?
3
0
Entering edit mode
5.5 years ago
silas008 ▴ 170

I am using edgeR to analyse a RNAseq data and I do not know if I should keep the information below for the analysis:

__no_feature
__ambiguous
__too_low_aQual
__not_aligned
__alignment_not_unique

I think I should keep them because I think edgeR need the information of the reads counted in these features to calculate CPM and logCPM but I am not sure.

Can someone help me?

Thank you

htseq edgeR RNAseq • 3.2k views
ADD COMMENT
2
Entering edit mode
5.5 years ago
AK ★ 2.2k

I noticed this when reading the codes at StatQuest: Filtering genes with low read counts. (line 97 - 103):

If you used htseq-count to count the reads per gene, then the last five rows of data are not for specific genes, but are summaries of how the counting went. We must remove them from our new data frame, or else edgeR will treat those last 5 rows as individual genes and throw off the statistics.

So you have to filter out the last five lines in the raw *.htseq, otherwise, your DGEList object will keep containing that summary information:

> head(raw.data)
              id  wt1  wt2  wt3  ko1  ko2  ko3
1 LOC001_0001.1    0    0    0    0    0    0
2 LOC001_0002.1   82   71   77   91   79   84
3 LOC001_0003.1 2651 2785 2759 2802 2142 3117
4 LOC001_0004.1  543  436  455  629  430  550
5 LOC001_0005.1  399  373  324  420  279  386
6 LOC001_0006.1  108   91   82  116   88  125
> y <- DGEList(counts = raw.data[, 2:ncol(raw.data)], genes = raw.data[, 1])
> tail(n = 5, y$genes)
                       genes
12858           __no_feature
12859            __ambiguous
12860        __too_low_aQual
12861          __not_aligned
12862 __alignment_not_unique
ADD COMMENT
1
Entering edit mode
5.5 years ago
ATpoint 85k

If you have the original fastq files, I suggest you use salmon to quantify them against a reference transcriptome of your choice. salmon can correct for GC bias and has an elaborate way to deal with multimappers. Following salmon read your quantifications into tximport to aggregate them to the gene level and correct for different transcript length, then feed it into edgeR. Please see the manuals of salmon and the vignettes of tximport at Bioconductor for the necessary steps.

ADD COMMENT
0
Entering edit mode
5.5 years ago
shawn.w.foley ★ 1.3k

Yes, edgeR takes the full HTSeq output in order to perform its normalizations.

ADD COMMENT
1
Entering edit mode

Are you sure edgeR uses these last 5 lines with __ prefix? I don't think those are important. Do you have any source for your claim?

For your information, here is the edgeR code: https://rdrr.io/bioc/edgeR/src/R/readDGE.R
As you can see a warning will be raised whenever a line starting with __ is read, as those are not actual gene lines.

ADD REPLY
0
Entering edit mode

I understand your point. But in this case how can edgeR know the total number of reads to calculate CPM, for example?

The total mapped reads number included those information. No?!

Thank you for helping.

ADD REPLY
1
Entering edit mode

CPM is calculated versus the total number of reads assigned to a gene, not necessarily the total number of (mapped) reads..

ADD REPLY
0
Entering edit mode

I am analysing miRNAs. If I cut those lines my CPM will be "number of reads/number of miRNA aligned reads". But in most of the cases I prefered "number of reads/number of aligned reads". For exemple, if I want to compare miRNAs with piRNAs in the same sample, if I cut those lines it is not possible to know the difference between the the total number of miRNAs e total number of piRNAs. For exemplo which of them is more expressed in my sample.

I think the best option is to keep "no feature" and "ambiguos" to have the number of total reads aligned by the aligner. Am I wrong?

ADD REPLY
0
Entering edit mode

Sum of mapped reads in each sample?

> head(raw.data)
              id  wt1  wt2  wt3  ko1  ko2  ko3
1 LOC_0001.1    0    0    0    0    0    0
2 LOC_0002.1   82   71   77   91   79   84
3 LOC_0003.1 2651 2785 2759 2802 2142 3117
4 LOC_0004.1  543  436  455  629  430  550
5 LOC_0005.1  399  373  324  420  279  386
6 LOC_0006.1  108   91   82  116   88  125

> raw.data %>% summarize_if(is.numeric, sum, na.rm = TRUE)
      wt1     wt2     wt3     ko1     ko2     ko3
1 4930639 4553016 4053316 5669966 3908059 5089968

> y <- DGEList(counts = raw.data[, 2:ncol(raw.data)], genes = raw.data[, 1])
> y$samples
    group lib.size norm.factors
wt1     1  4930639            1
wt2     1  4553016            1
wt3     1  4053316            1
ko1     1  5669966            1
ko2     1  3908059            1
ko3     1  5089968            1
ADD REPLY
0
Entering edit mode

Sorry, I didn't understand. Did you keep those lines in the first case and deleted them before use DGElist? If this is the case I agree with that.

More specifically, my question is: if I exclude those lines will edger normalize the data by the total number of mapped reads or by the number of reads mapped as genes (or something else, like miRNA, in my case?

ADD REPLY
0
Entering edit mode

Sorry for being unclear, I meant that the input to edgeR is the entire HTSeq output, it doesn't need to be edited prior to running the script.

ADD REPLY

Login before adding your answer.

Traffic: 2767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6