Hi all. I would really appreciate some comments in this problem that i am having.
I am trying to filter out low expressed tags with edgeR. This is my original data (eyes). It correspond to a count matrix from rsem.
> eyes = read.table("/home/user/path/to/file/eyes_counts.matrix", header=T, row.names=1, com='')
> dim(eyes)
[1] 171471 12
> head(eyes)
apo1_eyes apo2_eyes apo3_eyes apo4_eyes lux1_eyes lux2_eyes
comp230955_c0_seq10 0.00 0.00 0.00 0.00 0.00 0.00
comp135535_c0_seq1 1.00 1.00 0.00 0.00 0.00 0.00
comp222610_c5_seq4 1.37 4.61 18.68 3.88 1.89 5.72
comp227842_c8_seq3 0.00 0.00 0.00 12.27 0.00 0.00
comp230019_c3_seq1 71.04 141.78 639.86 426.29 356.35 497.67
comp215198_c1_seq3 0.00 52.18 5.78 21.78 0.00 0.00
lux3_eyes lux4_eyes WT1_eyes WT2_eyes WT3_eyes WT4_eyes
comp230955_c0_seq10 0.00 0 0 0.00 0 0.00
comp135535_c0_seq1 1.00 0 1 0.00 0 0.00
comp222610_c5_seq4 3.95 0 0 7.53 0 0.00
comp227842_c8_seq3 0.00 0 0 0.00 0 0.00
comp230019_c3_seq1 212.78 0 0 1136.26 0 213.10
comp215198_c1_seq3 0.00 0 0 0.00 0 36.03
I have 3 groups and 4 replicates per group. I seek tags that have 2 count per million for at least 4 libraries.
I run the command:
> keep_cpm <- rowSums(cpm(eyes) >2 ) >= 4
> eyes <- eyes[keep_cpm, ]
And I obtain:
> dim(eyes)
[1] 26840 12
> head(eyes)
apo1_eyes apo2_eyes apo3_eyes apo4_eyes lux1_eyes lux2_eyes
comp230019_c3_seq1 71.04 141.78 639.86 426.29 356.35 497.67
comp227814_c2_seq1 29.00 46.93 151.07 263.05 202.05 39.85
comp220002_c0_seq1 71.00 255.00 297.00 321.00 200.00 148.00
comp201532_c0_seq1 15.11 44.95 59.00 166.57 81.07 21.00
comp220971_c10_seq2 53.00 255.00 191.00 311.00 100.00 160.00
comp229841_c2_seq3 27.64 112.85 59.96 130.86 49.80 89.28
lux3_eyes lux4_eyes WT1_eyes WT2_eyes WT3_eyes WT4_eyes
comp230019_c3_seq1 212.78 0 0.00 1136.26 0 213.10
comp227814_c2_seq1 191.42 0 68.00 83.00 0 108.93
comp220002_c0_seq1 232.00 0 133.00 303.00 0 167.00
comp201532_c0_seq1 52.00 0 28.00 56.00 0 94.89
comp220971_c10_seq2 115.54 0 86.00 219.00 0 174.00
comp229841_c2_seq3 61.99 0 31.77 90.46 0 68.73
Regarding number it seems that it has work well. But the samples Lux4 and WT2 have mostly not counts. It is something that is not happening in the original data. Not only that, if I compare both files the counts are completly different!! I put here an example:
This is the same genes in both data files (original and cpm filtered):
Filtered data file:
id apo1_eyes apo2_eyes apo3_eyes apo4_eyes lux1_eyes lux2_eyes lux3_eyes lux4_eyes WT1_eyes WT2_eyes WT3_eyes WT4_eyes
comp230019_c3_seq1 71.04 141.78 639.86 426.29 356.35 497.67 212.78 0 0 1136.26 0 213.1
Counts of the same gene in the original data file:
id apo1_eyes apo2_eyes apo3_eyes apo4_eyes lux1_eyes lux2_eyes lux3_eyes lux4_eyes WT1_eyes WT2_eyes WT3_eyes WT4_eyes
comp230019_c3_seq1 0 0 639.89 284.34 356.35 497.65 70.84 142.3 0 1136.27 0 142.33
I do not understand what is happening. Some of the number counts are maintained (like apo3
) but not the others.
Any idea of what could be happening? Am I doing something wrong?
Thank you!!
In the code should be a 4 instead of 3. Like this:
keep_cpm <- rowSums(cpm(eyes) >2 ) >= 4
Sorry about that.
Any comment would be appreciated thanks!
This change has been made. Please note that you can always edit your posts (questions/answers) as well as comments, and also use moderation options to delete or move them.
The
comp230019_c3_seq1
example has the same values before/after filtering in the example posted at the top, but not the single-line examples at the bottom. That suggests that there's something happening before you perform filtering.Can you post the original file somewhere? That'd allow us to have a look and see exactly what's going on. Also please post the exact lines of code you're using before filtering the count matrix.
BTW, DO NOT USE THIS DATA WITH edgeR! These are expected counts, not integer counts. You can, however, use these with limma (put them through
voom()
first).hi Devon, Thank you for your reply.
I am not doing anything with the data befote filtering it. So there are not extra code lines to add.
The
comp230019_c3_seq1
has NOT the same values before/after filtering. in the example posted at the top where you can see the values corresponding tocomp230019_c3_seq1
(second list) is a part of the list of genes AFTER FILTERING. I look for the values as an example of the same component on the original data and that is what you can see on the bottom.Do you know what could be happening? Could be due because of the writing table step? I will edit the code for that in the post.
Regarding the data. I know I have to use integer counts. I am rounding it like this:
I thought it was correct that way, It is?
Thanks!
Please don't round your data. I know a lot of people write that that can be done, but if you ask the edgeR or DESeq2 authors they'll give you some excellent reasons not to do that (search the bioconductor support site). Please use limma instead.
Regarding the filtering, the values should never change due to it. If you can post the data somewhere, then we can try to determine why this is happening.