Hi !
I have a table (.tsv) with data, here are several rows from the top:
GeneID SRR1177686 SRR1026955 SRR1027004 SRR1026928 SRR1177692 SRR1026905 SRR1026942 SRR1177684 SRR1026984 SRR1026912
ENSG00000223972 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1
ENSG00000227232 32 38 34 81 38 15 7 47 93 68 4 17 24 31 46 19 0 69 41
ENSG00000278267 2 6 7 11 7 1 4 10 5 2 1 0 1 1 4 2 0 27 2
I read the file into R using read.delim:
countDF <- read.delim("/storage/DATA/DEE2/combined_DGE_table.tsv", row.names=1, check.names=FALSE)
And then I want to create a DGEList object:
dge <- DGEList(count=countDF)
But it throws an error "Negative counts not allowed".
What should I do to avoid that ? Check if there are any NA or negatives in my data and remove them ?
Try to see how your
countDF
looks like, and see if there are indeed negatives or something else that is weird.If you correct for batch effect via ComBat, then you will likely have negative counts. This is why nobody should be using ComBat for RNA-seq counts.
I'm confused by your DGEList() call.
What is the object y?
You should be using a command like:
dge <- DGEList(countDF)
May be you could suggest some pipline or something for estimating negative binomial (NB) distribution parameters ? I want to find out whether the data satisfy the law of NB.
Did you find any negative counts in your df? Or other oddities? If yes, your pipeline to produce a count table is not correct. It is no use to continue with such a set, so don't bother going into testing NB distribution or not. It would be helpful to show what you did to generate the count table, and how it looks like in your R environment (like I commented before, try to see how your countDF looks like, inspect it for errors, etc.).
There are no negatives in my data, I have checked..
If you run
str(MyData[,1:20])
, what is the output?...and what happens when you coerce it to a data matrix?
...or, possibly:
And also the output of:
The output of DGEList(count=data.matrix(countDF)):
The output for DGEList(count=as.matrix(countDF)):
And the output for table(countDF>=0):
There seems to be something wrong with your data frame (but we can't see what from your output), the last command suggests factors instead of integers. Can you also check the whole data frame (instead of only the first 20 columns)?
I can't put here the hole output becouse it is too large. Here is a part of it:
Yeah, my point was, that you could look for any
"factor"
columns instead of"integer"
, the warning seems to find 9"factors"
. Trouble shoot your own data is part of bioinformatics, try to figure out how your data frame looks like. I suspect some columns with factors at the end maybe? Maybe try to see the last 9 (how many columns does your data frame has?).Here they are:
Okay I hope you can trouble shoot it (yourself), check if the dimensions are the same as expected, if the headers and rownames are correct, etc. Good luck!
BTW, there is GeneID as a column in your dataset, see your two posts up. This is a factor, like expected your data frame is not correct.
Yes, I see. But I don't understand why is it there .. It should only be the first column.
I don't understand either, I haven't seen your code. But at least now you know what the problem is so you can fix it. Good luck with it!
Ok, thank you for the help !
Both DESeq2 and EdgeR (I believe) model your data as a negative binomial. There is in-built QC in these packages for you to check how good was the model fit