Here is the issue I have:
#Making a DBA object from samplesheet
db <- dba(sampleSheet = samplesheet)
#Count reads
db.cnt <- dba.count(db, minOverlap = 2, score = DBA_SCORE_RPKM, bParallel = FALSE)
Now at this stage, db.cnt
has a peaks (db.cnt$peaks
) list which contains the merged peaks in each sample with their read counts, and it also has a merged
(db.cnt$merged
) dataframe which has all the peaks. However, I noticed that the db.cnt$peaks
has "chr" in chromosome names, but db.cnt$merged
only has {1..24}, no "chr" prefix. However, if you check the distribution of chromosomes, there's a huge problem I found:
>table(db.cnt$peaks[[1]]$Chr)[1:12]
chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2
12734 8012 5744 6464 3905 4452 4418 3419 5626 1866 4360 15566
> table(db.cnt$merged[,1])[1:12]
1 2 3 4 5 6 7 8 9 10 11 12
12734 8012 5744 6464 3905 4452 4418 3419 5626 1866 4360 15566
Basically it converts a character vector to numeric, and chr10 becomes 2 in db.cnt$merged
data frame and so on.
This seems like a too naive a mistake to be made by such a big package, but I failed to see what's going on here?
Can anyone else check and confirm?
By default when converting a character to a factor in R, the factor levels will be in alphabetical order.
Factors are actually fancy integers under the hood.
So those numbers you are seeing are just the integer representation of the factor order of the chromosomes.
There are numerous ways this could happen (usually coercion of vector type), but I'm too lazy at the moment to check the source code.
You may want to consider opening your issue over at the Bioconductor Support Site, where the dev will be more likely to see your issue.