edgeR DGElist Error: Negative counts not allowed.
2
0
Entering edit mode
5.4 years ago
ma23 ▴ 40

Hi !

I have a table (.tsv) with data, here are several rows from the top:

GeneID  SRR1177686      SRR1026955      SRR1027004      SRR1026928      SRR1177692      SRR1026905      SRR1026942      SRR1177684      SRR1026984      SRR1026912

ENSG00000223972 0       1       0       1       0       0       0       0       1       0       0       0       1       1       0       0       0       0       1

ENSG00000227232 32      38      34      81      38      15      7       47      93      68      4       17      24      31      46      19      0       69      41

ENSG00000278267 2       6       7       11      7       1       4       10      5       2       1       0       1       1       4       2       0       27      2

I read the file into R using read.delim:

countDF <- read.delim("/storage/DATA/DEE2/combined_DGE_table.tsv", row.names=1, check.names=FALSE)

And then I want to create a DGEList object:

dge <- DGEList(count=countDF)

But it throws an error "Negative counts not allowed".

What should I do to avoid that ? Check if there are any NA or negatives in my data and remove them ?

edgeR • 6.0k views
ADD COMMENT
2
Entering edit mode

Try to see how your countDF looks like, and see if there are indeed negatives or something else that is weird.

summary(countDF) # Look by min. of each sample to see if they are 0 or negative
ADD REPLY
2
Entering edit mode

If you correct for batch effect via ComBat, then you will likely have negative counts. This is why nobody should be using ComBat for RNA-seq counts.

ADD REPLY
1
Entering edit mode

I'm confused by your DGEList() call.

What is the object y?

You should be using a command like: dge <- DGEList(countDF)

ADD REPLY
0
Entering edit mode

May be you could suggest some pipline or something for estimating negative binomial (NB) distribution parameters ? I want to find out whether the data satisfy the law of NB.

ADD REPLY
2
Entering edit mode

Did you find any negative counts in your df? Or other oddities? If yes, your pipeline to produce a count table is not correct. It is no use to continue with such a set, so don't bother going into testing NB distribution or not. It would be helpful to show what you did to generate the count table, and how it looks like in your R environment (like I commented before, try to see how your countDF looks like, inspect it for errors, etc.).

ADD REPLY
0
Entering edit mode

There are no negatives in my data, I have checked..

ADD REPLY
0
Entering edit mode

If you run str(MyData[,1:20]), what is the output?

ADD REPLY
0
Entering edit mode
    'data.frame':   58302 obs. of  20 variables:
     $ SRR1177686: int  0 32 2 1 0 0 0 0 0 152 ...
     $ SRR1026955: int  1 38 6 0 0 0 0 0 0 19 ...
     $ SRR1027004: int  0 34 7 0 0 0 0 0 0 32 ...
     $ SRR1026928: int  1 81 11 0 0 0 0 0 0 55 ...
     $ SRR1177692: int  0 38 7 0 0 0 0 2 0 11 ...
     $ SRR1026905: int  0 15 1 0 0 0 0 0 0 12 ...
     $ SRR1026942: int  0 7 4 0 0 0 0 0 0 14 ...
     $ SRR1177684: int  0 47 10 0 0 0 0 0 0 38 ...
     $ SRR1026984: int  1 93 5 0 0 0 0 0 0 33 ...
     $ SRR1026912: int  0 68 2 0 0 0 0 0 0 227 ...
     $ SRR1026987: int  0 4 1 0 0 0 0 0 0 11 ...
     $ SRR1026974: int  0 17 0 0 0 0 0 0 0 8 ...
     $ SRR1026960: int  1 24 1 0 0 0 0 0 0 2 ...
     $ SRR1177678: int  1 31 1 0 0 0 0 0 0 67 ...
     $ SRR1026939: int  0 46 4 0 0 0 0 0 0 43 ...
     $ SRR1177734: int  0 19 2 0 0 1 0 0 0 18 ...
     $ SRR1026963: int  0 0 0 0 0 0 0 0 0 9 ...
     $ SRR1177721: int  0 69 27 0 0 0 0 0 0 57 ...
     $ SRR1026886: int  1 41 2 0 0 0 0 0 0 50 ...
     $ SRR1026897: int  0 0 0 0 0 0 0 0 0 0 ...
ADD REPLY
0
Entering edit mode

...and what happens when you coerce it to a data matrix?

DGEList(count=data.matrix(countDF))

...or, possibly:

DGEList(count=as.matrix(countDF))

And also the output of:

table(countDF>=0)
ADD REPLY
0
Entering edit mode

The output of DGEList(count=data.matrix(countDF)):

Repeated column names found in count matrix
An object of class "DGEList"
$counts
                SRR1177686 SRR1026955 SRR1027004 ......
                ........
                ........
                ........
                SRR5981512 SRR5980983 SRR5980651 SRR5981590 SRR5980436
 [ reached getOption("max.print") -- omitted 5 rows ]
58297 more rows ...

$samples
  group lib.size norm.factors
1     1  9212384            1
2     1 11409546            1
3     1 12393257            1
4     1 16799268            1
5     1  9694151            1
4131 more rows ...

Warning message:
In DGEList(count = data.matrix(countDF)) : library size of zero detected

The output for DGEList(count=as.matrix(countDF)):

Error: Negative counts not allowed

And the output for table(countDF>=0):

TRUE 
240612354 
Warning messages:
1: In Ops.factor(left, right) : '>=' not meaningful for factors
2: In Ops.factor(left, right) : '>=' not meaningful for factors
3: In Ops.factor(left, right) : '>=' not meaningful for factors
4: In Ops.factor(left, right) : '>=' not meaningful for factors
5: In Ops.factor(left, right) : '>=' not meaningful for factors
6: In Ops.factor(left, right) : '>=' not meaningful for factors
7: In Ops.factor(left, right) : '>=' not meaningful for factors
8: In Ops.factor(left, right) : '>=' not meaningful for factors
9: In Ops.factor(left, right) : '>=' not meaningful for factors
ADD REPLY
0
Entering edit mode

There seems to be something wrong with your data frame (but we can't see what from your output), the last command suggests factors instead of integers. Can you also check the whole data frame (instead of only the first 20 columns)?

sapply(countDF, class)
ADD REPLY
0
Entering edit mode

I can't put here the hole output becouse it is too large. Here is a part of it:

SRR1177686 SRR1026955 SRR1027004 SRR1026928 SRR1177692 SRR1026905 SRR1026942 SRR1177684 
"integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer"
...............................................................................................................
SRR2049767  GeneID SRR3235827 SRR3235971 SRR3235963 SRR3235891 SRR3236033 SRR3235882 
"integer"   "factor"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
...............................................................................................................
SRR3235892 SRR3235793 SRR3235839 SRR3235932 SRR3235878 SRR3236073 SRR3235854 SRR3235818 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
 [ reached getOption("max.print") -- omitted 3136 entries ]
ADD REPLY
0
Entering edit mode

Yeah, my point was, that you could look for any "factor" columns instead of "integer", the warning seems to find 9 "factors". Trouble shoot your own data is part of bioinformatics, try to figure out how your data frame looks like. I suspect some columns with factors at the end maybe? Maybe try to see the last 9 (how many columns does your data frame has?).

sapply(countDF[, 3000:3009], class) # See the last 9 columns assuming you have 3009 columns
ADD REPLY
0
Entering edit mode

Here they are:

SRR5981836 SRR5982070 SRR5980672 SRR5981018 SRR5981994 SRR5981512 SRR5980983 SRR5980651 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
SRR5981590 SRR5980436 
 "integer"  "integer"
ADD REPLY
0
Entering edit mode

Okay I hope you can trouble shoot it (yourself), check if the dimensions are the same as expected, if the headers and rownames are correct, etc. Good luck!

ADD REPLY
1
Entering edit mode

BTW, there is GeneID as a column in your dataset, see your two posts up. This is a factor, like expected your data frame is not correct.

ADD REPLY
0
Entering edit mode

Yes, I see. But I don't understand why is it there .. It should only be the first column.

ADD REPLY
1
Entering edit mode

I don't understand either, I haven't seen your code. But at least now you know what the problem is so you can fix it. Good luck with it!

ADD REPLY
1
Entering edit mode

Ok, thank you for the help !

ADD REPLY
0
Entering edit mode

Both DESeq2 and EdgeR (I believe) model your data as a negative binomial. There is in-built QC in these packages for you to check how good was the model fit

ADD REPLY
2
Entering edit mode
5.4 years ago

What should I do to avoid that ? Check if there are any NA or negatives in my data

Yes.

and remove them ?

No. You should find out why these values are in your count table to begin with and re-generate the count table in a way that will not have values that clearly cannot represent read counts.

ADD COMMENT
2
Entering edit mode
5.4 years ago
Gordon Smyth ★ 7.7k

From the information shown in your comments above, it appears that your file combined_DGE_table.tsv actually has 10 columns of GeneIDs instead of just one. It looks like someone has merged together 10 different data.frames of counts and written out the combined data.frame to the tsv file without removing the GeneID columns. The first column of each of the individual data.frames would have contained GeneIDs, so there are 10 columns of GeneIDs in the combined file.

When edgeR tries to interpret the data.frame as a matrix, the GeneID columns will force the matrix to be of character mode instead of numeric, and any GeneID that starts with punctuation (like _ENS) will be interpreted by R as being negative. Hence the error message.

You can remove the GeneID columns by

cl <- sapply(countDF, class)
j <- which(cl == "factor")
countDF <- countDF[, -j]
dge <- DGEList(countDF)

If you upgrade to the latest Bioconductor release, which is Bioconductor 3.9 for R 3.6.0, the DGEList() function now issues a much more informative error message in this type of situation.

ADD COMMENT
0
Entering edit mode

Ok, I understand it now, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6