Entering edit mode
10.5 years ago
Parham
★
1.6k
I am trying to optimize GAGE workflow (1) for my samples. I only have one control and one replicate. I changed parameter "each" from 4 to 1 and I got an error. How should I change it so that it fit?
> grp.idx <- rep(c("experiment", "control"), each=1)
> coldat=DataFrame(grp=factor(grp.idx))
> dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
Error in validObject(.Object) :
invalid class "SummarizedExperiment" object: 'colData' nrow differs from 'assays' ncol
Cheers!
What's the output of
ncol(cnts)
?Right, so 8 samples with only 2 having a group assignment. The number of rows of
coldat
needs to equal the number of samples (i.e., the number of columns incnts
). Otherwise, you end up with samples with no group assignment and DESeq2 has no clue what you really want to do (thus, the error).I see! Then I think I am doing it totally wrong may be! I thought this workflow is something general with basic commands that can be applied to my data so I can get start with! But it turned out more complicated than it seemed! I managed to map my genes with Trapnel et al 2012 protocol for tophat and cufflinks and was quiet successful and I was looking for something like that to learn with, but this one is very confusing to me.
I haven't changed anything in the workflow and tried it as it was, and surprisingly there are 8 samples?! Shouldn't it be two based on one "experiment" and one "control"? How do I change rows of
coldat
?!Thanks Devon, you helped a lot so far.
It is a general workflow, but nothing can be so general that you give incorrect sample-group assignments (or wrong file names, or other things of that sort).
I would have to know how you made
cnts
to know why it has more columns than the two you expected. There should be one column for each sample, so either you actually do have more samples or you did something wrong when creating that matrix (that'd be my guess). You're makingcoldat
fromgrp.idx
, so just change how you make that. Having said that, don't just randomly change things until the error goes away. The values need to accurately describe the samples, or else you end up with some control samples being in the experiment group (and vice versa) and then the results are meaningless.You might want to get a book on using R, since knowing the basic concepts will make your life vastly easier (most example workflows assume you know at least the basics).
All right, I see your point. Now I got a book on R and I am trying to get a better understanding of it. Since I am a biology grad student and with very little knowledge on statistics and bioinfs, its very useful to get tips from people like you to show the "how to" path. However, now I think I get a better output from the scripts, at least until some points. I am including all the scripts I wrote for my experiment. I have two conditions and one replicate each! Named them pfh1 and pfh1-de!
from
gnCnt
it seems that it sorts my data correctly, having 7017 gene (expected) in 2 columns (conditions). Am I right?From this step I don't understand the concept of loading hnrnp data from gageData! Is that something I have to skip it for my analysis? That was an amateur mistake! Because the output of
dim(pfh1.cnts)
is[1] 7017 2
!I generally don't understand these following scripts! If you can kindly explain them please:
Thanks for reading all this!
Cheers!
Yeah, it looks like
gnCnt
has the correct form now. The last three lines just load an example dataset so that you can actually run things in R without having your own data. There's no reason for you to bother with those.Thanks, what is happening here?
The first line can be broken into two parts:
rowSums(cnts) != 0
, which compares the sum of each row to zero and returnsTrue
if the sum > 0; and thensel.rn =
, which just saves the vector ofTrue
/False
values. The purpose of that is to just see which rows aren't all zeros. Thencnts=cnts[sel.rn,]
subsetscnts
to contain only those rows.I see, but then it seems I get another problem, when I run it on my data! I lose all genes! All of them have zero values?!
How can I avoid this?
Check that the chromosome names of the BAM files match those of txdb. If they do, look at things in IGV or another genome browser.
How do I check the chromosome names?
From the command line:
samtools view -H foo.bam
(changefoo.bam
to the name of a BAM file). For that genome, the chromosome should beI
,II
,III
,MT
andAB325691
.Yes it looks it has the correct headers! What should I seek for in IGV?
I also checked IGV. However I didn't know what you exactly mean but I randomly checked some genes and their mapped reads, which looked fine! Is there any thing specific I have to look on?
There's nothing specific. The goal it just to eye-ball a couple genes and see if they actually have reads mapping to them. If they do and the counts you're getting with
summarizeOverlaps()
to even remotely match, then something is going wrong at that step.Do you have any idea what is going wrong?!
Not without seeing the data. You can always post the BAM file somewhere and I or someone else can just have a look.
All right! I can do that but I don't know where do people upload! If you could provide some links please!
Dropbox, google drive, just google for "large file send"
Hi Devon,
I have been having troubles with getting gage working and was following this thread to sort the issue. My chromosomes have the wrong headers. Is there an easy way to change these?
cheers in advance
Hi Justin,
Please post things like this as new questions, they're easier to track and will receive quicker responses that way.
SummarizedExperiment objects inherit the
seqlevels()
accessor. So you can just change all of the names in one relatively simple command, likeseqlevels(se) <- sprintf("chr%s", as.character(seqlevels(se)))
. Something along those lines should work for you.You have single-end data, so you need this
instead of this
Also, this
should instead be this
Thanks a million Devon! It is fine now!
Devon may I ask how you looked up for single-end/pair-end in the BAM file please?
I don't recall exactly what I did 4 months ago. If I had the BAM file, then I just looked at the flag fields. If not, I see that the header in one of your previous comments has the tophat command used, and that's only using single-end reads, so perhaps I went off of that (though I vaguely remember just looking at the dataset).
Yes, mine was a stupid question! I shared the BAM file here and you just checked something and said that I have single-end data! So do I have to convert BAM to SAM and check for flags?! If so is there any place explaining the flags?
Check here or the SAM specification if you really want more details.
BTW,
rep(c("experiment", "control"), each=1)
is the same asc("experiment", "control")
.