Question

Super enhancer calling using ROSE for mm39 genome: list index out of range error

0

Entering edit mode

3 days ago

Gayatri • 0

Hello,

I am trying to call super-enhancers using ROSE algorithm with mm39 genome for ChIP-seq data and encountering the following error:

MAKING TSS COLLECTION
Traceback (most recent call last):
  File "/path/to/file/ROSE/bin/ROSE_geneMapper.py", line 299, in <module>
    main()
  File "path/to/file/ROSE/bin/ROSE_geneMapper.py", line 282, in main
    enhancerToGeneTable,geneToEnhancerTable,withGenesTable = mapEnhancerToGene(annotFile,enhancerFile, uniqueGenes=True, byRefseq=options.refseq, subtractInput=options.control, transcribedFile=transcribedFile)
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/file/ROSE/bin/ROSE_geneMapper.py", line 140, in mapEnhancerToGene
    closestGene = allEnhancerGenes[distList.index(min(distList))]
                  ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Command used:

ROSE_main.py -g MM39 -i /path/to/file/treated1.bed -r /path/to/file/treated1.sorted.bam -c /path/to/file/treated1_input.sorted.bam -o . -s 12500 -t 2500

I have tried to make a refseq file for mm39 genome as specified in the instructions here. Hence any help in why the error might be occuring is highly appreciated.

ChIP-Seq super-enhancers ROSE mm39 • 417 views

ADD COMMENT • link 1 hour ago by Gayatri • 0

0

Entering edit mode

Did you update the source code in all the required spots as well?

ADD REPLY • link 2 days ago by jared.andrews07 ★ 18k

0

Entering edit mode

I did update the source code wherever required, changing it to MM39, adding mm39 refseq file similar to mm10 and others. The main problem is ROSE algorithm worked for control samples but is failing on the treated samples.

ADD REPLY • link 2 days ago by Gayatri • 0

0

Entering edit mode

Are you sure they were aligned to mm39 as well?

The original ROSE implementation has a fair number of problems. I'm frankly surprised it runs at all.

While incomplete, I have been building an R implementation of ROSE here. It does not match the results from the original implementation exactly due to minor differences in how coverage and calculated and a presumed bug in the original implementation that impacts the unstitching process of regions that overlap 3 or more TSSes of unique genes.

It is not well-tested, but it has a number of advantages over the original implementation, including much easier ways to change annotations used and exposing a number of hidden parameters. It also allows some monotonic transformations of signal that make the SE calls much more robust from sample to sample and not driven nearly so strongly by outliers.

It's also much, much faster (a few minutes per sample). Feel free to test it (and open issues if you run into them) if you'd like. I'd recommend installing from the dev branch - devtools::install_github("stjude-biohackathon/ECLIPSE@dev") if you do.

ADD REPLY • link 1 day ago by jared.andrews07 ★ 18k

0

Entering edit mode

I think they aligned to mm39 as each time the message 'USING MM39 AS THE GENOME' is displayed while running each part of the algorithm. But I am not sure how to check whether proper alignment is occurring. And thank you Dr. Jared, I will definitely try ECLIPSE and get back to you as soon as I get some output over here.

ADD REPLY • link 1 day ago by Gayatri • 0

0

Entering edit mode

I tried ECLIPSE twice, once without specifying the org.db, then with specifying org.db and got the following error:

Code:

> putative_enhancers <- run_rose(treatment = sample_bam, peaks = peaks,
+                                control = control_bam, org.db = "org.Mm.eg.db")

Output:

Treatment BAM file: treated1.sorted.bam
Control BAM file: treated1_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                               
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Stitching peaks with stitch distance of 12500                                    
Dropped 46 peaks on chrY
Calculating normalized signal for 12895 stitched regions
Ranking regions
Classifying enhancers
Using 14933.3727976734 as cutoff for SE classification
12 super enhancers called
Annotating regions
Error in annotate_enhancers(regions, peaks, tx.db = txdb, org.db = org.db,  : 
  TxDb and OrgDb objects must be provided to perform annotation.
In addition: Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 38079 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1,
  chrUn_GL456367v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456394v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  considered out-of-bound (use seqlengths() and isCircular() to get [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 71885 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1,
  chrUn_GL456367v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456381v1, chrUn_GL456382v1,
  chrUn_GL456383v1, chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1,
  chrUn_GL456390v1, chrUn_GL456392v1, chrUn_GL456394v1, chrUn_GL456396v1,
  chrUn_JH584304v1, chrUn_MU069435v1, chrX_GL456233v2_random,
  chrY_JH584300v1_random, chrY_JH584301v1_random, chrY_JH584302v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  con [... truncated]

I then tried the code for the sample for which the ROSE algorithm worked, and got a similar error for the same:

Treatment BAM file: control.sorted.bam
Control BAM file: control_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                               
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Stitching peaks with stitch distance of 12500                                    
Dropped 4 peaks on chrY
Calculating normalized signal for 8768 stitched regions
Ranking regions
Classifying enhancers
Using 4825.28495706356 as cutoff for SE classification
114 super enhancers called
Annotating regions
Error in annotate_enhancers(regions, peaks, tx.db = txdb, org.db = org.db,  : 
  TxDb and OrgDb objects must be provided to perform annotation.
In addition: Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 9580 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr4_JH584295v1_random, chr5_JH584296v1_random, chr5_JH584297v1_random,
  chr5_JH584298v1_random, chr5_GL456354v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456390v1, chrUn_GL456392v1, chrUn_GL456394v1,
  chrUn_GL456396v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  chrY_JH584302v1_random, and chrY_JH584303v1_random. Note that ranges
  located on a sequence whose length is unknown (NA) or on a circular
  sequence are not considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences).  [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 30758 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr4_JH584295v1_random, chr5_JH584296v1_random, chr5_JH584297v1_random,
  chr5_JH584298v1_random, chr5_GL456354v1_random, chr5_JH584299v1_random,
  chr7_GL456219v1_random, chrM, chrUn_GL456359v1, chrUn_GL456360v1,
  chrUn_GL456366v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456381v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456396v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  considered out-of-bound (use seqlengths() and isCircular()  [... truncated]

I am not sure how to input TxDb and OrgDb objects, for OrgDb, I specified OrgDb of Mouse. Should I remove the random, unnamed chromosomes? Or is it something else that might be raising the error?

ADD REPLY • link 23 hours ago by Gayatri • 0

0

Entering edit mode

Oh, thanks for giving it a shot. You need to provide both TxDb and OrgDb objects, e.g.

library(TxDb.Mmusculus.UCSC.mm39.knownGene)
library(org.Mm.eg.db)

txdb <- TxDb.Mmusculus.UCSC.mm39.knownGene
org.db <- org.Mm.eg.db

putative_enhancers <- run_rose(treatment = sample_bam, peaks = peaks,
                               control = control_bam, org.db = org.db, txdb = txdb, 
                               tss.exclusion.distance = 2500,
                               max.unique.gene.tss.overlap = 2,
                               stitch.distance = 12500)

I've also shown the addition of a few parameters here that match what you were trying to do with the original ROSE calls. I have updated the vignette and README to better show how to provide the annotation data.

ADD REPLY • link 17 hours ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thank you Dr. Jared. I used the code and it is working on both samples, but the warning is persistent in both outputs, as shown below in the output for treated sample run. Also, I noticed that only 3 SEs were called when I included TxDb with a cut-off of 12665.775570056, but when I provided annotate = FALSE, 12 SEs were called for the treated with a cut-off of 14933.3727976734, while for control sample, more than 30 SEs were called with a cut-off of 2589.83706832244.

Treatment BAM file: treated1.sorted.bam
Control BAM file: treated1_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                  
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Excluding peaks within TSS exclusion distance of 2500               
55343 peaks fully contained within TSS exclusion window and will be excluded from stitching
Stitching peaks with stitch distance of 12500
Dropped 43 peaks on chrY
Unstitching regions overlapping TSS from more than 2 unique genes
Unstitched 4 regions
Calculating normalized signal for 4506 stitched regions
Ranking regions
Classifying enhancers
Using 12665.775570056 as cutoff for SE classification
3 super enhancers called
Annotating regions
Filtering annotations
Adding annotations to metadata
Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 3 out-of-bound ranges located on
  sequences chr1_GL456239v1_random, chr4_JH584295v1_random,
  and chr5_JH584296v1_random. Note that ranges located on a
  sequence whose length is unknown (NA) or on a circular
  sequence are not considered out-of-bound (use seqlengths()
  and isCircular() to get the lengths and circularity flags of
  the underlying sequences). You can use trim() to trim these
  ranges. See ?`trim,GenomicRanges-method` for more
  information.
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 38079 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random, chrM,
  chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1, chrUn_GL456367v1,
  chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1, chrUn_GL456378v1,
  chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1, chrUn_GL456385v1,
  chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1, chrUn_GL456392v1,
  chrUn_GL456394v1, chrUn_JH584304v1, chrUn_MU069435v1, chrX_GL456233v2_random,
  chrY_JH584300v1_random, chrY_JH584301v1_random, and chrY_JH584303v1_random. Note that
  ranges located on a sequence whose length is unknown (NA) or on a circular sequence are
  not considered out-of-bound (use seqlengths() and isCircular() to get t [... truncated]
3: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 71885 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random, chrM,
  chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1, chrUn_GL456367v1,
  chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1, chrUn_GL456378v1,
  chrUn_GL456379v1, chrUn_GL456381v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456394v1, chrUn_GL456396v1, chrUn_JH584304v1,
  chrUn_MU069435v1, chrX_GL456233v2_random, chrY_JH584300v1_random,
  chrY_JH584301v1_random, chrY_JH584302v1_random, and chrY_JH584303v1_random. Note that
  ranges located on a sequence whose length is unknown (NA) or on a circular sequence are
  not consi [... truncated]

In addition, is there a way I can export the results for future use? And is it possible to find any instructions/manual sort of thing with detailed usage for doing various trial analyses with the parameters since this is relatively new for me? I am grateful for the ECLIPSE package Dr. Jared, thank you for developing it as well as suggesting its usage.

ADD REPLY • link 1 hour ago by Gayatri • 0