Hi all. I'm working with paired-end rna-seq data to assemble transcriptome of my species of interest. I've just realized that Tophat is understating the number of reads that I actually have and supplied in the input files for running tophat command. Here is a fragment of Tophat's progress report:
[2014-01-22 19:29:06] Beginning TopHat run (v2.0.10)
-----------------------------------------------
[2014-01-22 19:29:06] Checking for Bowtie
Bowtie version: 2.1.0.0
[2014-01-22 19:29:06] Checking for Samtools
Samtools version: 0.1.19.0
[2014-01-22 19:29:06] Checking for Bowtie index files (genome)..
[2014-01-22 19:29:06] Checking for reference FASTA file
Warning: Could not find FASTA file bowtie/tritrypdb_tcongolense.fa
[2014-01-22 19:29:06] Reconstituting reference FASTA file from Bowtie index
Executing: /usr/local/bin/bowtie2-inspect bowtie/tritrypdb_tcongolense > tophat/tmp/tritrypdb_tcongolense.fa
[2014-01-22 19:29:08] Generating SAM header for bowtie/tritrypdb_tcongolense
[2014-01-22 19:29:09] Reading known junctions from GTF file
[2014-01-22 19:29:10] Preparing reads
left reads: min. length=100, max. length=100, 56927836 kept reads (17504 discarded)
right reads: min. length=100, max. length=100, 56919726 kept reads (25614 discarded)
And here is the content of "align_summary.txt" file:
Left reads:
Input : 3877069
Mapped : 3102050 (80.0% of input)
of these: 528309 (17.0%) have multiple alignments (2142 have >20)
Right reads:
Input : 3877068
Mapped : 2972012 (76.7% of input)
of these: 495699 (16.7%) have multiple alignments (2114 have >20)
78.3% overall read mapping rate.
Aligned pairs: 2823914
of these: 470594 (16.7%) have multiple alignments
43915 ( 1.6%) are discordant alignments
71.7% concordant pair alignment rate.
As you can see, for example for the left reads, there were 56,927,836 reads kept by Tophat for mapping to transcriptome, but the "align_summary.txt" file says that there were only 3,877,069 reads! Any clue where this difference comes from?
Martin
Just FYI. I can't help but notice the
Warning: Could not find FASTA file bowtie/tritrypdb_tcongolense.fa
in your post. I have got the same warning before. Try to change your genome file name and annotation file name into the same name as your bowtie index (e.g. Apple.gtf, Apple.fa, and Apple.1.bt, Apple.2.bt....) and put all of them into the same folder. That could omit the step ofReconstituting reference FASTA file from Bowtie index
but might not have anything to do with your problem.You might want to rename this question, your wording makes it sound as if align_summary.txt is incorrectly counting the reads in the bams, while jgranke reports the much more serious problem that multi-threading is simply disappearing a bunch of reads.