Tophat - Understated Number Of Reads In The "Align_Summary.Txt" File
3
0
Entering edit mode
10.8 years ago
wstfljs ▴ 100

Hi all. I'm working with paired-end rna-seq data to assemble transcriptome of my species of interest. I've just realized that Tophat is understating the number of reads that I actually have and supplied in the input files for running tophat command. Here is a fragment of Tophat's progress report:

[2014-01-22 19:29:06] Beginning TopHat run (v2.0.10)
-----------------------------------------------
[2014-01-22 19:29:06] Checking for Bowtie
          Bowtie version:     2.1.0.0
[2014-01-22 19:29:06] Checking for Samtools
        Samtools version:     0.1.19.0
[2014-01-22 19:29:06] Checking for Bowtie index files (genome)..
[2014-01-22 19:29:06] Checking for reference FASTA file
    Warning: Could not find FASTA file bowtie/tritrypdb_tcongolense.fa
[2014-01-22 19:29:06] Reconstituting reference FASTA file from Bowtie index
  Executing: /usr/local/bin/bowtie2-inspect bowtie/tritrypdb_tcongolense > tophat/tmp/tritrypdb_tcongolense.fa
[2014-01-22 19:29:08] Generating SAM header for bowtie/tritrypdb_tcongolense
[2014-01-22 19:29:09] Reading known junctions from GTF file
[2014-01-22 19:29:10] Preparing reads
     left reads: min. length=100, max. length=100, 56927836 kept reads (17504 discarded)
    right reads: min. length=100, max. length=100, 56919726 kept reads (25614 discarded)

And here is the content of "align_summary.txt" file:

Left reads:
          Input     :   3877069
           Mapped   :   3102050 (80.0% of input)
            of these:    528309 (17.0%) have multiple alignments (2142 have >20)
Right reads:
          Input     :   3877068
           Mapped   :   2972012 (76.7% of input)
            of these:    495699 (16.7%) have multiple alignments (2114 have >20)
78.3% overall read mapping rate.

Aligned pairs:   2823914
     of these:    470594 (16.7%) have multiple alignments
                   43915 ( 1.6%) are discordant alignments
71.7% concordant pair alignment rate.

As you can see, for example for the left reads, there were 56,927,836 reads kept by Tophat for mapping to transcriptome, but the "align_summary.txt" file says that there were only 3,877,069 reads! Any clue where this difference comes from?

Martin

tophat2 mapping paired-end rna-seq • 6.5k views
ADD COMMENT
0
Entering edit mode

Just FYI. I can't help but notice the Warning: Could not find FASTA file bowtie/tritrypdb_tcongolense.fa in your post. I have got the same warning before. Try to change your genome file name and annotation file name into the same name as your bowtie index (e.g. Apple.gtf, Apple.fa, and Apple.1.bt, Apple.2.bt....) and put all of them into the same folder. That could omit the step of Reconstituting reference FASTA file from Bowtie index but might not have anything to do with your problem.

ADD REPLY
0
Entering edit mode

You might want to rename this question, your wording makes it sound as if align_summary.txt is incorrectly counting the reads in the bams, while jgranke reports the much more serious problem that multi-threading is simply disappearing a bunch of reads.

ADD REPLY
1
Entering edit mode
10.8 years ago
wstfljs ▴ 100

Sorry for posting a self-reply, but I might have stumbled upon something, that could explain the strange (understated) number of input reads in the "align_summary.txt" file. It might have something to do with the "-p" option for specifying a number of threads being used. If I don't specify this parameter (default is 1) I get the correct number of reads in the "align_summary.txt" file. Here is the example for a different dataset than the above.

[2014-01-21 22:46:31] Beginning TopHat run (v2.0.10)
-----------------------------------------------
[2014-01-21 22:46:31] Checking for Bowtie
          Bowtie version:     2.1.0.0
[2014-01-21 22:46:31] Checking for Samtools
        Samtools version:     0.1.19.0
[2014-01-21 22:46:31] Checking for Bowtie index files (genome)..
[2014-01-21 22:46:31] Checking for reference FASTA file
    Warning: Could not find FASTA file bowtie/plasmodb_pberghei.fa
[2014-01-21 22:46:31] Reconstituting reference FASTA file from Bowtie index
  Executing: /usr/local/bin/bowtie2-inspect bowtie/plasmodb_pberghei > tophat/tmp/plasmodb_pberghei.fa
[2014-01-21 22:46:32] Generating SAM header for bowtie/plasmodb_pberghei
[2014-01-21 22:46:33] Reading known junctions from GTF file
[2014-01-21 22:46:34] Preparing reads
     left reads: min. length=100, max. length=100, 71177666 kept reads (2201 discarded)
    right reads: min. length=100, max. length=100, 71154133 kept reads (25734 discarded)

align_sumary.txt

Left reads:
          Input     :  71179867
           Mapped   :  65140290 (91.5% of input)
            of these:   2827816 ( 4.3%) have multiple alignments (3354 have >20)
Right reads:
          Input     :  71179867
           Mapped   :  64591420 (90.7% of input)
            of these:   2852565 ( 4.4%) have multiple alignments (3363 have >20)
91.1% overall read mapping rate.

Aligned pairs:  63204040
     of these:   2740001 ( 4.3%) have multiple alignments
                  190295 ( 0.3%) are discordant alignments
88.5% concordant pair alignment rate.

Now, the question is WHY using multiple threads while running tophat command causes the software to "lose" the majority of the input reads?

ADD COMMENT
0
Entering edit mode

Hi, I'm running into the same problem with tophat v2.0.13. Did you find a solution for this other than running everything with -p 1?

ADD REPLY
0
Entering edit mode

i got this problems too,and i think the most weird thing is that not all the data was influenced by the thread numbers when running with tophat2, some data still got good alignment rates even with 8 threads or 16 threads while another not, then i also checked the relationship with alignment modes,however Both part of PE and SE data get the same problem,so i was wondering is that an random event?

ADD REPLY
1
Entering edit mode
8.3 years ago
jgranek ▴ 10

This problem still persists in TopHat v2.1.0 (installed from jessie-backports)!!

With "--num-threads 1" align_summary.txt says: Reads: Input : 1000000 Mapped : 972457 (97.2% of input) 97.2% overall read mapping rate.

With "--num-threads 4" align_summary.txt says: Reads: Input : 250728 Mapped : 243870 (97.3% of input) 97.3% overall read mapping rate.

I have checked "samtools idxstats" for accepted_hits.bam and unmapped.bam to confirm that reads are disappearing, not just being overlooked in generating align_summary.txt

ADD COMMENT
0
Entering edit mode
8.6 years ago
FlorentC • 0

Hi Dears,

I have the same problem with tophat v2.0.13 and some colleague do not have this problem with this version.

Is anyone having an explanation for that ?

ADD COMMENT

Login before adding your answer.

Traffic: 1890 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6