Question

Super enhancer calling using ROSE for mm39 genome: list index out of range error

0

Entering edit mode

2 days ago

Gayatri • 0

Hello,

I am trying to call super-enhancers using ROSE algorithm with mm39 genome for ChIP-seq data and encountering the following error:

MAKING TSS COLLECTION
Traceback (most recent call last):
  File "/path/to/file/ROSE/bin/ROSE_geneMapper.py", line 299, in <module>
    main()
  File "path/to/file/ROSE/bin/ROSE_geneMapper.py", line 282, in main
    enhancerToGeneTable,geneToEnhancerTable,withGenesTable = mapEnhancerToGene(annotFile,enhancerFile, uniqueGenes=True, byRefseq=options.refseq, subtractInput=options.control, transcribedFile=transcribedFile)
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/file/ROSE/bin/ROSE_geneMapper.py", line 140, in mapEnhancerToGene
    closestGene = allEnhancerGenes[distList.index(min(distList))]
                  ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Command used:

ROSE_main.py -g MM39 -i /path/to/file/treated1.bed -r /path/to/file/treated1.sorted.bam -c /path/to/file/treated1_input.sorted.bam -o . -s 12500 -t 2500

I have tried to make a refseq file for mm39 genome as specified in the instructions here. Hence any help in why the error might be occuring is highly appreciated.

ChIP-Seq super-enhancers ROSE mm39 • 384 views

ADD COMMENT • link updated 8 hours ago by jared.andrews07 ★ 18k • written 2 days ago by Gayatri • 0

0

Entering edit mode

Did you update the source code in all the required spots as well?

ADD REPLY • link 2 days ago by jared.andrews07 ★ 18k

0

Entering edit mode

I did update the source code wherever required, changing it to MM39, adding mm39 refseq file similar to mm10 and others. The main problem is ROSE algorithm worked for control samples but is failing on the treated samples.

ADD REPLY • link 1 day ago by Gayatri • 0

0

Entering edit mode

Are you sure they were aligned to mm39 as well?

The original ROSE implementation has a fair number of problems. I'm frankly surprised it runs at all.

While incomplete, I have been building an R implementation of ROSE here. It does not match the results from the original implementation exactly due to minor differences in how coverage and calculated and a presumed bug in the original implementation that impacts the unstitching process of regions that overlap 3 or more TSSes of unique genes.

It is not well-tested, but it has a number of advantages over the original implementation, including much easier ways to change annotations used and exposing a number of hidden parameters. It also allows some monotonic transformations of signal that make the SE calls much more robust from sample to sample and not driven nearly so strongly by outliers.

It's also much, much faster (a few minutes per sample). Feel free to test it (and open issues if you run into them) if you'd like. I'd recommend installing from the dev branch - devtools::install_github("stjude-biohackathon/ECLIPSE@dev") if you do.

ADD REPLY • link 1 day ago by jared.andrews07 ★ 18k

0

Entering edit mode

I think they aligned to mm39 as each time the message 'USING MM39 AS THE GENOME' is displayed while running each part of the algorithm. But I am not sure how to check whether proper alignment is occurring. And thank you Dr. Jared, I will definitely try ECLIPSE and get back to you as soon as I get some output over here.

ADD REPLY • link 18 hours ago by Gayatri • 0

0

Entering edit mode

I tried ECLIPSE twice, once without specifying the org.db, then with specifying org.db and got the following error:

Code:

> putative_enhancers <- run_rose(treatment = sample_bam, peaks = peaks,
+                                control = control_bam, org.db = "org.Mm.eg.db")

Output:

Treatment BAM file: treated1.sorted.bam
Control BAM file: treated1_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                               
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Stitching peaks with stitch distance of 12500                                    
Dropped 46 peaks on chrY
Calculating normalized signal for 12895 stitched regions
Ranking regions
Classifying enhancers
Using 14933.3727976734 as cutoff for SE classification
12 super enhancers called
Annotating regions
Error in annotate_enhancers(regions, peaks, tx.db = txdb, org.db = org.db,  : 
  TxDb and OrgDb objects must be provided to perform annotation.
In addition: Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 38079 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1,
  chrUn_GL456367v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456394v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  considered out-of-bound (use seqlengths() and isCircular() to get [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 71885 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1,
  chrUn_GL456367v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456381v1, chrUn_GL456382v1,
  chrUn_GL456383v1, chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1,
  chrUn_GL456390v1, chrUn_GL456392v1, chrUn_GL456394v1, chrUn_GL456396v1,
  chrUn_JH584304v1, chrUn_MU069435v1, chrX_GL456233v2_random,
  chrY_JH584300v1_random, chrY_JH584301v1_random, chrY_JH584302v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  con [... truncated]

I then tried the code for the sample for which the ROSE algorithm worked, and got a similar error for the same:

Treatment BAM file: control.sorted.bam
Control BAM file: control_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                               
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Stitching peaks with stitch distance of 12500                                    
Dropped 4 peaks on chrY
Calculating normalized signal for 8768 stitched regions
Ranking regions
Classifying enhancers
Using 4825.28495706356 as cutoff for SE classification
114 super enhancers called
Annotating regions
Error in annotate_enhancers(regions, peaks, tx.db = txdb, org.db = org.db,  : 
  TxDb and OrgDb objects must be provided to perform annotation.
In addition: Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 9580 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr4_JH584295v1_random, chr5_JH584296v1_random, chr5_JH584297v1_random,
  chr5_JH584298v1_random, chr5_GL456354v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456390v1, chrUn_GL456392v1, chrUn_GL456394v1,
  chrUn_GL456396v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  chrY_JH584302v1_random, and chrY_JH584303v1_random. Note that ranges
  located on a sequence whose length is unknown (NA) or on a circular
  sequence are not considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences).  [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 30758 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr4_JH584295v1_random, chr5_JH584296v1_random, chr5_JH584297v1_random,
  chr5_JH584298v1_random, chr5_GL456354v1_random, chr5_JH584299v1_random,
  chr7_GL456219v1_random, chrM, chrUn_GL456359v1, chrUn_GL456360v1,
  chrUn_GL456366v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456381v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456396v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  considered out-of-bound (use seqlengths() and isCircular()  [... truncated]

I am not sure how to input TxDb and OrgDb objects, for OrgDb, I specified OrgDb of Mouse. Should I remove the random, unnamed chromosomes? Or is it something else that might be raising the error?

ADD REPLY • link 15 hours ago by Gayatri • 0

0

Entering edit mode

Oh, thanks for giving it a shot. You need to provide both TxDb and OrgDb objects, e.g.

library(TxDb.Mmusculus.UCSC.mm39.knownGene)
library(org.Mm.eg.db)

txdb <- TxDb.Mmusculus.UCSC.mm39.knownGene
org.db <- org.Mm.eg.db

putative_enhancers <- run_rose(treatment = sample_bam, peaks = peaks,
                               control = control_bam, org.db = org.db, txdb = txdb, 
                               tss.exclusion.distance = 2500,
                               max.unique.gene.tss.overlap = 2,
                               stitch.distance = 12500)

I've also shown the addition of a few parameters here that match what you were trying to do with the original ROSE calls. I have updated the vignette and README to better show how to provide the annotation data.

ADD REPLY • link 8 hours ago by jared.andrews07 ★ 18k