Question

Coverage-search vs. no coverage-search in running Tophat

0

Entering edit mode

8.4 years ago

tunl ▴ 90

I have two questions regarding running Tophat:

(1) At the step “Searching for junctions via segment mapping”, it takes a really long time, and I got the following message:

“Coverage-search algorithm is turned on, making this step very slow
    Please try running TopHat again with the option (--no-coverage-search) if this step takes too much time or memory.”

I’d like to know what exactly the differences between “coverage-search” and “no coverage-search” are. If I use “--no-coverage-search” option, what impact it may have on the Tophat results and accuracy?

(2) I use –G option to provide gene model annotation GTF file (genes.gtf). I notice that for each Tophat run, it builds bowtie index for genes.gtf on-the-fly:

“Building Bowtie index from genes.fa”

This step takes two hours (I use the main annotation gtf file for human from GENCODE).

I have 3 conditions and each condition has 3-4 pairs of fastq reads, so I have 10 Tophat runs in my script. This “Building Bowtie index from genes.fa” step was executed for 10 times even though they all use the same genes.gtf file.

So I am wondering if there is a way to let Tophat re-use the bowtie index for genes.gtf produced from the first run in the subsequent runs?

I’d greatly appreciate any ideas and suggestions.

Thank you very much!

RNA-Seq Tophat • 5.0k views

ADD COMMENT • link updated 8.4 years ago by Devon Ryan 104k • written 8.4 years ago by tunl ▴ 90

score 1 · Answer 1 · 2016-07-17

The coverage search option is described thusly in the tophat2 manual:

TopHat generates its database of possible splice junctions from two sources of evidence. The first and strongest source of evidence for a splice junction is when two segments from the same read (for reads of at least 45bp) are mapped at a certain distance on the same genomic sequence or when an internal segment fails to map - again suggesting that such reads are spanning multiple exons. With this approach, "GT-AG", "GC-AG" and "AT-AC" introns will be found ab initio. The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. We only suggest users use this second option (--coverage-search) for short reads (< 45bp) and with a small number of reads (<= 10 million). This latter option will only report alignments across "GT-AG" introns.

If you have short single-end reads and really need to find novel splice sites then you probably need the --coverage-search option. Otherwise skip it.

Regarding the reindexing of the transcriptome every time, you can instead use --transcriptome-index some_directory and then only index things once. An example is at the bottom of the tophat manual webpage.