Question

Tophat - Understated Number Of Reads In The "Align_Summary.Txt" File

0

Entering edit mode

10.8 years ago

wstfljs ▴ 100

Hi all. I'm working with paired-end rna-seq data to assemble transcriptome of my species of interest. I've just realized that Tophat is understating the number of reads that I actually have and supplied in the input files for running tophat command. Here is a fragment of Tophat's progress report:

[2014-01-22 19:29:06] Beginning TopHat run (v2.0.10)
-----------------------------------------------
[2014-01-22 19:29:06] Checking for Bowtie
          Bowtie version:     2.1.0.0
[2014-01-22 19:29:06] Checking for Samtools
        Samtools version:     0.1.19.0
[2014-01-22 19:29:06] Checking for Bowtie index files (genome)..
[2014-01-22 19:29:06] Checking for reference FASTA file
    Warning: Could not find FASTA file bowtie/tritrypdb_tcongolense.fa
[2014-01-22 19:29:06] Reconstituting reference FASTA file from Bowtie index
  Executing: /usr/local/bin/bowtie2-inspect bowtie/tritrypdb_tcongolense > tophat/tmp/tritrypdb_tcongolense.fa
[2014-01-22 19:29:08] Generating SAM header for bowtie/tritrypdb_tcongolense
[2014-01-22 19:29:09] Reading known junctions from GTF file
[2014-01-22 19:29:10] Preparing reads
     left reads: min. length=100, max. length=100, 56927836 kept reads (17504 discarded)
    right reads: min. length=100, max. length=100, 56919726 kept reads (25614 discarded)

And here is the content of "align_summary.txt" file:

Left reads:
          Input     :   3877069
           Mapped   :   3102050 (80.0% of input)
            of these:    528309 (17.0%) have multiple alignments (2142 have >20)
Right reads:
          Input     :   3877068
           Mapped   :   2972012 (76.7% of input)
            of these:    495699 (16.7%) have multiple alignments (2114 have >20)
78.3% overall read mapping rate.

Aligned pairs:   2823914
     of these:    470594 (16.7%) have multiple alignments
                   43915 ( 1.6%) are discordant alignments
71.7% concordant pair alignment rate.

As you can see, for example for the left reads, there were 56,927,836 reads kept by Tophat for mapping to transcriptome, but the "align_summary.txt" file says that there were only 3,877,069 reads! Any clue where this difference comes from?

Martin

tophat2 mapping paired-end rna-seq • 6.5k views

ADD COMMENT • link updated 8.3 years ago by jgranek ▴ 10 • written 10.8 years ago by wstfljs ▴ 100

0

Entering edit mode

Just FYI. I can't help but notice the Warning: Could not find FASTA file bowtie/tritrypdb_tcongolense.fa in your post. I have got the same warning before. Try to change your genome file name and annotation file name into the same name as your bowtie index (e.g. Apple.gtf, Apple.fa, and Apple.1.bt, Apple.2.bt....) and put all of them into the same folder. That could omit the step of Reconstituting reference FASTA file from Bowtie index but might not have anything to do with your problem.

ADD REPLY • link 8.6 years ago by CandiceChuDVM ★ 2.5k

0

Entering edit mode

You might want to rename this question, your wording makes it sound as if align_summary.txt is incorrectly counting the reads in the bams, while jgranke reports the much more serious problem that multi-threading is simply disappearing a bunch of reads.

ADD REPLY • link 8.3 years ago by swbarnes2 14k

score 1 · Answer 1 · 2014-02-13

Sorry for posting a self-reply, but I might have stumbled upon something, that could explain the strange (understated) number of input reads in the "align_summary.txt" file. It might have something to do with the "-p" option for specifying a number of threads being used. If I don't specify this parameter (default is 1) I get the correct number of reads in the "align_summary.txt" file. Here is the example for a different dataset than the above.

[2014-01-21 22:46:31] Beginning TopHat run (v2.0.10)
-----------------------------------------------
[2014-01-21 22:46:31] Checking for Bowtie
          Bowtie version:     2.1.0.0
[2014-01-21 22:46:31] Checking for Samtools
        Samtools version:     0.1.19.0
[2014-01-21 22:46:31] Checking for Bowtie index files (genome)..
[2014-01-21 22:46:31] Checking for reference FASTA file
    Warning: Could not find FASTA file bowtie/plasmodb_pberghei.fa
[2014-01-21 22:46:31] Reconstituting reference FASTA file from Bowtie index
  Executing: /usr/local/bin/bowtie2-inspect bowtie/plasmodb_pberghei > tophat/tmp/plasmodb_pberghei.fa
[2014-01-21 22:46:32] Generating SAM header for bowtie/plasmodb_pberghei
[2014-01-21 22:46:33] Reading known junctions from GTF file
[2014-01-21 22:46:34] Preparing reads
     left reads: min. length=100, max. length=100, 71177666 kept reads (2201 discarded)
    right reads: min. length=100, max. length=100, 71154133 kept reads (25734 discarded)

align_sumary.txt

Left reads:
          Input     :  71179867
           Mapped   :  65140290 (91.5% of input)
            of these:   2827816 ( 4.3%) have multiple alignments (3354 have >20)
Right reads:
          Input     :  71179867
           Mapped   :  64591420 (90.7% of input)
            of these:   2852565 ( 4.4%) have multiple alignments (3363 have >20)
91.1% overall read mapping rate.

Aligned pairs:  63204040
     of these:   2740001 ( 4.3%) have multiple alignments
                  190295 ( 0.3%) are discordant alignments
88.5% concordant pair alignment rate.

Now, the question is WHY using multiple threads while running tophat command causes the software to "lose" the majority of the input reads?

score 1 · Answer 2 · 2016-08-11

This problem still persists in TopHat v2.1.0 (installed from jessie-backports)!!

With "--num-threads 1" align_summary.txt says: Reads: Input : 1000000 Mapped : 972457 (97.2% of input) 97.2% overall read mapping rate.

With "--num-threads 4" align_summary.txt says: Reads: Input : 250728 Mapped : 243870 (97.3% of input) 97.3% overall read mapping rate.

I have checked "samtools idxstats" for accepted_hits.bam and unmapped.bam to confirm that reads are disappearing, not just being overlooked in generating align_summary.txt

score 0 · Answer 3 · 2016-04-08

0

Entering edit mode

8.6 years ago

FlorentC • 0

Hi Dears,

I have the same problem with tophat v2.0.13 and some colleague do not have this problem with this version.

Is anyone having an explanation for that ?

ADD COMMENT • link 8.6 years ago by FlorentC • 0