Question

Major issue with DiffBind 3?

0

Entering edit mode

4.6 years ago

halffedelf ▴ 40

Here is the issue I have:

#Making a DBA object from samplesheet
db <- dba(sampleSheet = samplesheet)
#Count reads
db.cnt <- dba.count(db, minOverlap = 2, score = DBA_SCORE_RPKM, bParallel = FALSE)

Now at this stage, db.cnt has a peaks (db.cnt$peaks) list which contains the merged peaks in each sample with their read counts, and it also has a merged (db.cnt$merged) dataframe which has all the peaks. However, I noticed that the db.cnt$peaks has "chr" in chromosome names, but db.cnt$merged only has {1..24}, no "chr" prefix. However, if you check the distribution of chromosomes, there's a huge problem I found:

>table(db.cnt$peaks[[1]]$Chr)[1:12]

 chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2
12734  8012  5744  6464  3905  4452  4418  3419  5626  1866  4360 15566

> table(db.cnt$merged[,1])[1:12]

    1     2     3     4     5     6     7     8     9    10    11    12
12734  8012  5744  6464  3905  4452  4418  3419  5626  1866  4360 15566

Basically it converts a character vector to numeric, and chr10 becomes 2 in db.cnt$merged data frame and so on.

This seems like a too naive a mistake to be made by such a big package, but I failed to see what's going on here?

Can anyone else check and confirm?

atac-seq diffbind diffbind3 ChIP-Seq bioconductor • 1.3k views

ADD COMMENT • link updated 4.5 years ago by Rory Stark ★ 2.1k • written 4.6 years ago by halffedelf ▴ 40

1

Entering edit mode

By default when converting a character to a factor in R, the factor levels will be in alphabetical order.

chr <- sprintf("chr%s", seq(1, 13, 2))
> chr
[1] "chr1"  "chr3"  "chr5"  "chr7"  "chr9"  "chr11" "chr13"

> as.factor(chr)
[1] chr1  chr3  chr5  chr7  chr9  chr11 chr13
Levels: chr1 chr11 chr13 chr3 chr5 chr7 chr9

Factors are actually fancy integers under the hood.

> typeof(as.factor(chr))
[1] "integer"

So those numbers you are seeing are just the integer representation of the factor order of the chromosomes.

There are numerous ways this could happen (usually coercion of vector type), but I'm too lazy at the moment to check the source code.

ADD REPLY • link 4.6 years ago by rpolicastro 13k

0

Entering edit mode

You may want to consider opening your issue over at the Bioconductor Support Site, where the dev will be more likely to see your issue.

ADD REPLY • link 4.6 years ago by jared.andrews07 ★ 19k

score 2 · Accepted Answer · 2021-02-20

DiffBind package dev here, just seeing this now for some reason. The $merged representation is undocumented and not intended to be manipulated directly by users. It uses indices into a vector of chromosome names ($chrmap), based on factor values. These are all converted back to the correct chromosome names when using the documented interface functions. These types of data should be retrieved using the dba.peakset() function with bRetrieve=TRUE; more specific peaksets can be obtained via dba.report(), dba.overlap(), etc.