Hey Guys,
Using simulated data, I have compared 2 recent methods for identifying circRNAs, CIRI & CIRCexplorer (tophat fusion version). I find both to have weird behaviors and would like to know reasons for these.
Firstly, simulated data was generated from previously identified structures in circbase.org at depths of 2, 10 and 25 (100bp reads). Both methods were used to analyse these datasets to assess their performance, results are below:
Data CIRI CIRCexplorer
depth - 2 3195 725
depth - 10 4397 121
depth - 25 4400 109
As you can see, CIRCexplorer appears less sensitive but very specific. However, the behavior that concerns me is that: with high coverage, less circRNAs were identified. Is there a reason for this? Parameters used are those shown in: https://github.com/YangLab/CIRCexplorer. Please advise on suitable parameters that rectifies this behavior.
CIRI is however not perfect. It currently appears impossible to run CIRI on real data with over 300 million reads. It halts every time (after about 10 hrs), apparently due to high memory consumption, does anyone know a way round this? I must add that it is also less specific, identifying circRNAs not expected in the simulated dataset.
Please find simulated data here: https://www.dropbox.com/s/r5ms1zymngk0oyy/simulated_data.tar.gz?dl=0
Note: Other methods were also assessed (including methods using STAR aligner). My question is specific to these two methods, because of the weird behaviors mentioned above.
I have also tested CIRI, CIRCexplorer (STAR mapping), and find_circ, a about half a year ago with real data (Jeck et al 2013). I have to go back to my notes to provide some more insight, but I can confirm this:
I tried giving it ~150GB RAM, and it still crashed after running for a couple of days, and I do not have a way round it. According to the authors, which I contacted, they have developed CIRI using RNA-seq datasets not enriched for circRNAs, and thus the reasonable memory requirements mentioned in their paper. It seems that when using circRNA-enriched (RNAseR treated) samples, the memory consumption shots up.
I agree with you. The CIRI authors mention memory consumption ~20% of the size of the SAM file. I find that this is not the case. For a SAM file of about 135GB, it is currently using up to 70GB of RAM and was halted overnight on a node with 100GB of RAM.
It is a good tool, but the memory consumption makes it impossible to use. Quick question, did you ever use CIRCexplorer (tophat fusion version)? I am curious to know why sensitivity drops with increased coverage, thank you.
No. I tried it but kept running into mapping errors. In the meantime the developers of CIRCexplorer added an option for STAR and I went with that. Also I did not test sensitivity in that manner. I was more interested in whether the tools could find the experimentally validated circRNAs (they did). That said, CIRCexplorer found a reasonably large, and comparable number of circRNAs to find_circ.
Have you simulated the data or taken simulated data from any publication?
Hello Geek_y, I have generated simulated data from 4500 published circRNAs. CIRI found most of the structures with up to 125 false positives (structures not expected). I have linked some of the FASTQ data for anyone to re-analyse, I can send more privately.
From my analysis results from simulation data, the depth has no big influence on the CIRCexplorer final results. However if the depth is too low, the bias would be great. In addition, I think you should use some experimentally validated circRNAs to do the simulation (for example, circRNAs detected in RNaseR RNAseq). The development of CIRCexplorer is to increasingly improve sensitivity on the premise of high specificity.