A problem with low tag filtering with edgeR using rowSums(cpm(d). It is modifiyng the libraries
0
3
Entering edit mode
10.1 years ago
silvi_free88 ▴ 50

Hi all. I would really appreciate some comments in this problem that i am having.

I am trying to filter out low expressed tags with edgeR. This is my original data (eyes). It correspond to a count matrix from rsem.

> eyes = read.table("/home/user/path/to/file/eyes_counts.matrix", header=T, row.names=1, com='')
> dim(eyes)
[1] 171471     12
> head(eyes)
                    apo1_eyes apo2_eyes apo3_eyes apo4_eyes lux1_eyes lux2_eyes
comp230955_c0_seq10      0.00      0.00      0.00      0.00      0.00      0.00
comp135535_c0_seq1       1.00      1.00      0.00      0.00      0.00      0.00
comp222610_c5_seq4       1.37      4.61     18.68      3.88      1.89      5.72
comp227842_c8_seq3       0.00      0.00      0.00     12.27      0.00      0.00
comp230019_c3_seq1      71.04    141.78    639.86    426.29    356.35    497.67
comp215198_c1_seq3       0.00     52.18      5.78     21.78      0.00      0.00
                    lux3_eyes lux4_eyes WT1_eyes WT2_eyes WT3_eyes WT4_eyes
comp230955_c0_seq10      0.00         0        0     0.00        0     0.00
comp135535_c0_seq1       1.00         0        1     0.00        0     0.00
comp222610_c5_seq4       3.95         0        0     7.53        0     0.00
comp227842_c8_seq3       0.00         0        0     0.00        0     0.00
comp230019_c3_seq1     212.78         0        0  1136.26        0   213.10
comp215198_c1_seq3       0.00         0        0     0.00        0    36.03

I have 3 groups and 4 replicates per group. I seek tags that have 2 count per million for at least 4 libraries.

I run the command:

> keep_cpm <- rowSums(cpm(eyes) >2 ) >= 4
> eyes <- eyes[keep_cpm, ]

And I obtain:

> dim(eyes)
[1] 26840    12
> head(eyes)
                    apo1_eyes apo2_eyes apo3_eyes apo4_eyes lux1_eyes lux2_eyes
comp230019_c3_seq1      71.04    141.78    639.86    426.29    356.35    497.67
comp227814_c2_seq1      29.00     46.93    151.07    263.05    202.05     39.85
comp220002_c0_seq1      71.00    255.00    297.00    321.00    200.00    148.00
comp201532_c0_seq1      15.11     44.95     59.00    166.57     81.07     21.00
comp220971_c10_seq2     53.00    255.00    191.00    311.00    100.00    160.00
comp229841_c2_seq3      27.64    112.85     59.96    130.86     49.80     89.28
                    lux3_eyes lux4_eyes WT1_eyes WT2_eyes WT3_eyes WT4_eyes
comp230019_c3_seq1     212.78         0     0.00  1136.26        0   213.10
comp227814_c2_seq1     191.42         0    68.00    83.00        0   108.93
comp220002_c0_seq1     232.00         0   133.00   303.00        0   167.00
comp201532_c0_seq1      52.00         0    28.00    56.00        0    94.89
comp220971_c10_seq2    115.54         0    86.00   219.00        0   174.00
comp229841_c2_seq3      61.99         0    31.77    90.46        0    68.73

Regarding number it seems that it has work well. But the samples Lux4 and WT2 have mostly not counts. It is something that is not happening in the original data. Not only that, if I compare both files the counts are completly different!! I put here an example:

This is the same genes in both data files (original and cpm filtered):

Filtered data file:

id                        apo1_eyes     apo2_eyes     apo3_eyes     apo4_eyes     lux1_eyes     lux2_eyes     lux3_eyes     lux4_eyes     WT1_eyes     WT2_eyes     WT3_eyes     WT4_eyes
comp230019_c3_seq1        71.04         141.78        639.86        426.29        356.35        497.67        212.78        0             0            1136.26      0            213.1

Counts of the same gene in the original data file:

id                          apo1_eyes     apo2_eyes     apo3_eyes     apo4_eyes     lux1_eyes     lux2_eyes     lux3_eyes     lux4_eyes     WT1_eyes     WT2_eyes     WT3_eyes     WT4_eyes
comp230019_c3_seq1          0             0             639.89        284.34        356.35        497.65        70.84         142.3         0            1136.27      0            142.33

I do not understand what is happening. Some of the number counts are maintained (like apo3) but not the others.

Any idea of what could be happening? Am I doing something wrong?

Thank you!!

rna-seq R • 4.1k views
ADD COMMENT
0
Entering edit mode

In the code should be a 4 instead of 3. Like this: keep_cpm <- rowSums(cpm(eyes) >2 ) >= 4

Sorry about that.

Any comment would be appreciated thanks!

ADD REPLY
0
Entering edit mode

This change has been made. Please note that you can always edit your posts (questions/answers) as well as comments, and also use moderation options to delete or move them.

ADD REPLY
0
Entering edit mode

The comp230019_c3_seq1 example has the same values before/after filtering in the example posted at the top, but not the single-line examples at the bottom. That suggests that there's something happening before you perform filtering.

Can you post the original file somewhere? That'd allow us to have a look and see exactly what's going on. Also please post the exact lines of code you're using before filtering the count matrix.

BTW, DO NOT USE THIS DATA WITH edgeR! These are expected counts, not integer counts. You can, however, use these with limma (put them through voom() first).

ADD REPLY
0
Entering edit mode

hi Devon, Thank you for your reply.

I am not doing anything with the data befote filtering it. So there are not extra code lines to add.

The comp230019_c3_seq1 has NOT the same values before/after filtering. in the example posted at the top where you can see the values corresponding to comp230019_c3_seq1 (second list) is a part of the list of genes AFTER FILTERING. I look for the values as an example of the same component on the original data and that is what you can see on the bottom.

Do you know what could be happening? Could be due because of the writing table step? I will edit the code for that in the post.

Regarding the data. I know I have to use integer counts. I am rounding it like this:

> eyes = round(eyes)

I thought it was correct that way, It is?

Thanks!

ADD REPLY
0
Entering edit mode

Please don't round your data. I know a lot of people write that that can be done, but if you ask the edgeR or DESeq2 authors they'll give you some excellent reasons not to do that (search the bioconductor support site). Please use limma instead.

Regarding the filtering, the values should never change due to it. If you can post the data somewhere, then we can try to determine why this is happening.

ADD REPLY

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6