Entering edit mode
9.3 years ago
jambler24
▴
30
I've been doing differential expression analysis between 24 different samples in a pairwise fashion (1 vs 12, 2, vs 13 etc...) and for some reason two of the cuffDiff runs seem to be missing one of the samples reads:
For the genes.count_tracking
file:
tracking_id S11_count S11_count_variance S11_count_uncertainty_var S11_count_dispersion_var S11_status S23_count S23_count_variance S23_count_uncertainty_var S23_count_dispersion_var S23_status
XLOC_000001 33437.6 7.76171e+07 0 7.10768e+07 OK 0 8.85715 0 0 OK
XLOC_000002 9653.58 2.06459e+07 0 2.05202e+07 OK 0 3.79735 0 0 OK
XLOC_000003 0 1 0 0 OK 0.986189 2 0 1.59757 OK
XLOC_000004 78.2659 166367 0 166366 OK 3.94476 7 0 6.39026 OK
XLOC_000005 2.55775 5437 0 5436.89 OK 2.95857 5 0 4.7927 OK
XLOC_000006 0 1 0 0 OK 0 1 0 0 OK
XLOC_000007 5.06907 10776 0 10775.1 OK 0 1 0 0 OK
XLOC_000008 637.689 9475.35 0 9486.91 OK 0 0 0 0 OK
XLOC_000009 124.699 525 0 524.256 OK 0 0 0 0 OK
XLOC_000010 694.462 11220 0 11220 OK 0 0 0 0 OK
For the cds.read_group_tracking
file:
tracking_id condition replicate raw_frags internal_scaled_frags external_scaled_frags FPKM effective_length status
P1 S11 0 22813.6 23128.8 23128.8 396.39 - OK
P1 S23 0 0 0 0 0 - OK
P10 S11 0 0 0 0 0 - OK
P10 S23 0 0 0 0 0 - OK
P100 S11 0 2222 2252.69 2252.69 372.465 - OK
P100 S23 0 1 0.986189 0.986189 -nan - OK
P1000 S11 0 2250 2281.08 2281.08 529.99 - OK
P1000 S23 0 0 0 0 0 - OK
P1001 S11 0 2208 2238.5 2238.5 106.648 - OK
P1001 S23 0 1 0.986189 0.986189 -nan - OK
P1002 S11 0 19030 19292.9 19292.9 1234.82 - OK
P1002 S23 0 0 0 0 0 - OK
P1003 S11 0 4986 5054.88 5054.88 611.858 - OK
P1003 S23 0 0 0 0 0 - OK
P1004 S11 0 1308 1326.07 1326.07 179.374 - OK
P1004 S23 0 0 0 0 0 - OK
P1005 S11 0 3439 3486.51 3486.51 188.68 - OK
P1005 S23 0 0 0 0 0 - OK
P1006 S11 0 1144 1159.8 1159.8 388.503 - OK
P1006 S23 0 0 0 0 0 - OK
P1007 S11 0 3555 3604.11 3604.11 145.245 - OK
P1007 S23 0 1 0.986189 0.986189 -nan - OK
P1008 S11 0 5202 5273.86 5273.86 1181.96 - OK
P1008 S23 0 0 0 0 0 - OK
etc...
The other runs worked perfectly, its just two that have this problem. Around 95% of the reads aligned to the reference genome for all the samples, and there are no errors being thrown by any of the tools in the pipeline.
Any ideas what else could be causing this?
Do I understand your question correctly and cuffdiff skips the first sam/bam/cxb file? If you use the sample-sheet option to pass these files to cuffdiff, you may have forgotten to write a header line in the sample-sheet file.
The command I used is:
And is the same for the other comparisons (e.g. sample 4 vs sample 16) which worked. That's the strange part.
Maybe it helps using the sample-sheet option. This should be less error prone anyway...
You can find an explanation of this here:
http://cole-trapnell-lab.github.io/cufflinks/file_formats/index.html#sample-sheets-for-cuffdiff-and-cuffnorm
Have a look at the BAM files in IGV and see if the numbers make any sense.
BTW, I hope you realize that pairwise comparisons between samples aren't going to produce very meaningful results.
Had a look, but nothing was immediately apparent... Still looking into it while I wait for the run using a sample sheet. And yes, don't worry, this was just for the ''make sure it runs" phase :)
Why so Devon? Even when I run n samples, the output is for all C(n,2) combinations. What is the difference between running each pair individually and letting cuffdiff do the pairing for me?
Any 1 vs. 1 comparison will be largely meaningless. You're fitting a model with no degrees of freedom and can't use the results to predict anything aside from differences between the two samples being compared (i.e., you found that two people/animals/whatever are different...so what?). The only time I can think of this being useful is in the clinic where each patient is an N of 1, but even there you'd really want a background population of people to compare them against in order for the results to be very interpretable.
Pairwise comparisons between groups, on the other hand, often makes total sense.
Yes, that's what we are doing - pairwise comparisons among cohorts. Ah I see the "between samples" now.