Super enhancer calling using ROSE for mm39 genome: list index out of range error
0
0
Entering edit mode
17 days ago
Gayatri • 0

Hello,

I am trying to call super-enhancers using ROSE algorithm with mm39 genome for ChIP-seq data and encountering the following error:

MAKING TSS COLLECTION
Traceback (most recent call last):
  File "/path/to/file/ROSE/bin/ROSE_geneMapper.py", line 299, in <module>
    main()
  File "path/to/file/ROSE/bin/ROSE_geneMapper.py", line 282, in main
    enhancerToGeneTable,geneToEnhancerTable,withGenesTable = mapEnhancerToGene(annotFile,enhancerFile, uniqueGenes=True, byRefseq=options.refseq, subtractInput=options.control, transcribedFile=transcribedFile)
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/file/ROSE/bin/ROSE_geneMapper.py", line 140, in mapEnhancerToGene
    closestGene = allEnhancerGenes[distList.index(min(distList))]
                  ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

Command used:

ROSE_main.py -g MM39 -i /path/to/file/treated1.bed -r /path/to/file/treated1.sorted.bam -c /path/to/file/treated1_input.sorted.bam -o . -s 12500 -t 2500

I have tried to make a refseq file for mm39 genome as specified in the instructions here. Hence any help in why the error might be occuring is highly appreciated.

ChIP-Seq super-enhancers ROSE mm39 • 773 views
ADD COMMENT
0
Entering edit mode

Did you update the source code in all the required spots as well?

ADD REPLY
0
Entering edit mode

I did update the source code wherever required, changing it to MM39, adding mm39 refseq file similar to mm10 and others. The main problem is ROSE algorithm worked for control samples but is failing on the treated samples.

ADD REPLY
1
Entering edit mode

Are you sure they were aligned to mm39 as well?

The original ROSE implementation has a fair number of problems. I'm frankly surprised it runs at all.

While incomplete, I have been building an R implementation of ROSE here. It does not match the results from the original implementation exactly due to minor differences in how coverage and calculated and a presumed bug in the original implementation that impacts the unstitching process of regions that overlap 3 or more TSSes of unique genes.

It is not well-tested, but it has a number of advantages over the original implementation, including much easier ways to change annotations used and exposing a number of hidden parameters. It also allows some monotonic transformations of signal that make the SE calls much more robust from sample to sample and not driven nearly so strongly by outliers.

It's also much, much faster (a few minutes per sample). Feel free to test it (and open issues if you run into them) if you'd like. I'd recommend installing from the dev branch - devtools::install_github("j-andrews7/ECLIPSE@dev") if you do.

ADD REPLY
0
Entering edit mode

I think they aligned to mm39 as each time the message 'USING MM39 AS THE GENOME' is displayed while running each part of the algorithm. But I am not sure how to check whether proper alignment is occurring. And thank you Dr. Jared, I will definitely try ECLIPSE and get back to you as soon as I get some output over here.

ADD REPLY
0
Entering edit mode

I tried ECLIPSE twice, once without specifying the org.db, then with specifying org.db and got the following error:

Code:

> putative_enhancers <- run_rose(treatment = sample_bam, peaks = peaks,
+                                control = control_bam, org.db = "org.Mm.eg.db")

Output:

Treatment BAM file: treated1.sorted.bam
Control BAM file: treated1_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                               
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Stitching peaks with stitch distance of 12500                                    
Dropped 46 peaks on chrY
Calculating normalized signal for 12895 stitched regions
Ranking regions
Classifying enhancers
Using 14933.3727976734 as cutoff for SE classification
12 super enhancers called
Annotating regions
Error in annotate_enhancers(regions, peaks, tx.db = txdb, org.db = org.db,  : 
  TxDb and OrgDb objects must be provided to perform annotation.
In addition: Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 38079 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1,
  chrUn_GL456367v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456394v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  considered out-of-bound (use seqlengths() and isCircular() to get [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 71885 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1,
  chrUn_GL456367v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456381v1, chrUn_GL456382v1,
  chrUn_GL456383v1, chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1,
  chrUn_GL456390v1, chrUn_GL456392v1, chrUn_GL456394v1, chrUn_GL456396v1,
  chrUn_JH584304v1, chrUn_MU069435v1, chrX_GL456233v2_random,
  chrY_JH584300v1_random, chrY_JH584301v1_random, chrY_JH584302v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  con [... truncated]

I then tried the code for the sample for which the ROSE algorithm worked, and got a similar error for the same:

Treatment BAM file: control.sorted.bam
Control BAM file: control_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                               
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Stitching peaks with stitch distance of 12500                                    
Dropped 4 peaks on chrY
Calculating normalized signal for 8768 stitched regions
Ranking regions
Classifying enhancers
Using 4825.28495706356 as cutoff for SE classification
114 super enhancers called
Annotating regions
Error in annotate_enhancers(regions, peaks, tx.db = txdb, org.db = org.db,  : 
  TxDb and OrgDb objects must be provided to perform annotation.
In addition: Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 9580 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr4_JH584295v1_random, chr5_JH584296v1_random, chr5_JH584297v1_random,
  chr5_JH584298v1_random, chr5_GL456354v1_random, chr7_GL456219v1_random,
  chrM, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456390v1, chrUn_GL456392v1, chrUn_GL456394v1,
  chrUn_GL456396v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  chrY_JH584302v1_random, and chrY_JH584303v1_random. Note that ranges
  located on a sequence whose length is unknown (NA) or on a circular
  sequence are not considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences).  [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 30758 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr4_JH584295v1_random, chr5_JH584296v1_random, chr5_JH584297v1_random,
  chr5_JH584298v1_random, chr5_GL456354v1_random, chr5_JH584299v1_random,
  chr7_GL456219v1_random, chrM, chrUn_GL456359v1, chrUn_GL456360v1,
  chrUn_GL456366v1, chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1,
  chrUn_GL456378v1, chrUn_GL456381v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456396v1, chrUn_JH584304v1, chrUn_MU069435v1,
  chrX_GL456233v2_random, chrY_JH584300v1_random, chrY_JH584301v1_random,
  and chrY_JH584303v1_random. Note that ranges located on a sequence
  whose length is unknown (NA) or on a circular sequence are not
  considered out-of-bound (use seqlengths() and isCircular()  [... truncated]

I am not sure how to input TxDb and OrgDb objects, for OrgDb, I specified OrgDb of Mouse. Should I remove the random, unnamed chromosomes? Or is it something else that might be raising the error?

ADD REPLY
1
Entering edit mode

Oh, thanks for giving it a shot. You need to provide both TxDb and OrgDb objects, e.g.

library(TxDb.Mmusculus.UCSC.mm39.knownGene)
library(org.Mm.eg.db)

txdb <- TxDb.Mmusculus.UCSC.mm39.knownGene
org.db <- org.Mm.eg.db

putative_enhancers <- run_rose(treatment = sample_bam, peaks = peaks,
                               control = control_bam, org.db = org.db, txdb = txdb, 
                               tss.exclusion.distance = 2500,
                               max.unique.gene.tss.overlap = 2,
                               stitch.distance = 12500)

I've also shown the addition of a few parameters here that match what you were trying to do with the original ROSE calls. I have updated the vignette and README to better show how to provide the annotation data.

ADD REPLY
0
Entering edit mode

Thank you Dr. Jared. I used the code and it is working on both samples, but the warning is persistent in both outputs, as shown below in the output for treated sample run. Also, I noticed that only 3 SEs were called when I included TxDb with a cut-off of 12665.775570056, but when I provided annotate = FALSE, 12 SEs were called for the treated with a cut-off of 14933.3727976734, while for control sample, more than 30 SEs were called with a cut-off of 2589.83706832244.

Treatment BAM file: treated1.sorted.bam
Control BAM file: treated1_input.sorted.bam
Reading peaks
Rows: 2 Columns: 9                                                  
Column specification 
Delimiter: "\t"
chr (3): X1, X4, X6
dbl (6): X2, X3, X5, X7, X8, X9

Use `spec()` to retrieve the full column specification for this data.
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Excluding peaks within TSS exclusion distance of 2500               
55343 peaks fully contained within TSS exclusion window and will be excluded from stitching
Stitching peaks with stitch distance of 12500
Dropped 43 peaks on chrY
Unstitching regions overlapping TSS from more than 2 unique genes
Unstitched 4 regions
Calculating normalized signal for 4506 stitched regions
Ranking regions
Classifying enhancers
Using 12665.775570056 as cutoff for SE classification
3 super enhancers called
Annotating regions
Filtering annotations
Adding annotations to metadata
Warning messages:
1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 3 out-of-bound ranges located on
  sequences chr1_GL456239v1_random, chr4_JH584295v1_random,
  and chr5_JH584296v1_random. Note that ranges located on a
  sequence whose length is unknown (NA) or on a circular
  sequence are not considered out-of-bound (use seqlengths()
  and isCircular() to get the lengths and circularity flags of
  the underlying sequences). You can use trim() to trim these
  ranges. See ?`trim,GenomicRanges-method` for more
  information.
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 38079 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random, chrM,
  chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1, chrUn_GL456367v1,
  chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1, chrUn_GL456378v1,
  chrUn_GL456379v1, chrUn_GL456382v1, chrUn_GL456383v1, chrUn_GL456385v1,
  chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1, chrUn_GL456392v1,
  chrUn_GL456394v1, chrUn_JH584304v1, chrUn_MU069435v1, chrX_GL456233v2_random,
  chrY_JH584300v1_random, chrY_JH584301v1_random, and chrY_JH584303v1_random. Note that
  ranges located on a sequence whose length is unknown (NA) or on a circular sequence are
  not considered out-of-bound (use seqlengths() and isCircular() to get t [... truncated]
3: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 71885 out-of-bound ranges located on sequences
  chr1_GL456210v1_random, chr1_GL456211v1_random, chr1_GL456212v1_random,
  chr1_GL456221v1_random, chr1_MU069434v1_random, chr1_GL456239v1_random,
  chr5_JH584296v1_random, chr5_JH584297v1_random, chr5_JH584298v1_random,
  chr5_GL456354v1_random, chr5_JH584299v1_random, chr7_GL456219v1_random, chrM,
  chrUn_GL456359v1, chrUn_GL456360v1, chrUn_GL456366v1, chrUn_GL456367v1,
  chrUn_GL456368v1, chrUn_GL456370v1, chrUn_GL456372v1, chrUn_GL456378v1,
  chrUn_GL456379v1, chrUn_GL456381v1, chrUn_GL456382v1, chrUn_GL456383v1,
  chrUn_GL456385v1, chrUn_GL456387v1, chrUn_GL456389v1, chrUn_GL456390v1,
  chrUn_GL456392v1, chrUn_GL456394v1, chrUn_GL456396v1, chrUn_JH584304v1,
  chrUn_MU069435v1, chrX_GL456233v2_random, chrY_JH584300v1_random,
  chrY_JH584301v1_random, chrY_JH584302v1_random, and chrY_JH584303v1_random. Note that
  ranges located on a sequence whose length is unknown (NA) or on a circular sequence are
  not consi [... truncated]

In addition, is there a way I can export the results for future use? And is it possible to find any instructions/manual sort of thing with detailed usage for doing various trial analyses with the parameters since this is relatively new for me? I am grateful for the ECLIPSE package Dr. Jared, thank you for developing it as well as suggesting its usage.

ADD REPLY
2
Entering edit mode

The warnings can be safely ignored, they are due to the TxDb containing many seqlevels not found in your peaks object. It is on the list to address, as it is a lot of extra noise.

Also, I noticed that only 3 SEs were called when I included TxDb with a cut-off of 12665.775570056, but when I provided annotate = FALSE, 12 SEs were called for the treated with a cut-off of 14933.3727976734, while for control sample, more than 30 SEs were called with a cut-off of 2589.83706832244.

The annotation does not impact the number of regions returned or the cutoff for SE classification. I would, however, expect differences between your initial run and the command I provided even with annotate = FALSE, as ECLIPSE won't perform the unstitching process that ROSE does by default (though on second thought, perhaps it should, even though the logic behind that process seems dubious). ROSE uses in-built (very old) annotations - ECLIPSE forces the user to provide the annotations via the TxDb, which while slightly less convenient, is much more flexible.

And is it possible to find any instructions/manual sort of thing with detailed usage for doing various trial analyses with the parameters since this is relatively new for me?

The vignette is the most comprehensive workflow we have so far. It includes some rough test code for differential analyses and viz that are not yet included in the package. While it doesn't have exhaustive trials of the various parameters (as we're still adding them), it does mention a few of them and how they impact the results. We will perform more exhaustive comparisons when we get a chance. Again, this is still in development and in an alpha state.

In addition, is there a way I can export the results for future use?

You have a few options. Can save the GRanges output as an RDS via saveRDS or convert to a dataframe and save as a text file, e.g.:

df <- as(putative_enhancers , "data.frame")
write.table(df, "ECLIPSE_results.txt", sep = "\t", row.names = TRUE, col.names = TRUE, quote = FALSE)

# Which can also be read back into a GRanges object.
df2 <- read.table("ECLIPSE_results.txt", header = TRUE, sep = "\t", dec = ".")
gr2 <- makeGRangesFromDataFrame(df2, keep.extra.columns=TRUE)
ADD REPLY
0
Entering edit mode

Thank you so much for your support and suggests Dr. Jared. I will further explore ECLIPSE and open issues on github if I run into any errors. Once again thank you for your time.

Regards, Gayatri.

ADD REPLY

Login before adding your answer.

Traffic: 2421 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6