I used bam2fastx to convert the unmapped.bam output from TopHat into new fastq files. Then I re-ran Tophat with those fastq files against a bacterial genome and got the following output:
Left reads:
Input: 5476828
Mapped: 656 ( 0.0% of input)
of these: 5 ( 0.8%) have multiple alignments (0 have >20)
Right reads:
Input: 5477475
Mapped: 1303 ( 0.0% of input)
of these: 1714 (131.5%) have multiple alignments (0 have >20)
0.0% overall read alignment rate.
Aligned pairs: 62
of these: 0 ( 0.0%) have multiple alignments
and: 0 ( 0.0%) are discordant alignments
Two things seem off to me 1) a major discrepancy between the number of mapped left versus right reads and 2) having more reads with multiple alignments than the number of mapped reads in the first place. Is that expected? I think it's a sign that something somewhere went wrong. I've read elsewhere that bam2fastx doesn't work well with paired end data because unmapped.bam is missing some samtools flags. Could that be the problem? If so, is there a workaround?
Here were the commands I used to generate these files:
samtools sort -n unmapped.bam unmapped_sort
bam2fastx -q -Q -A -o new.fastq unmapped_sort.bam
cd /Directory/OrganismAnnotation
tophat -p 8 -I 500 -r 200 --mate-std-dev 80 --no-mixed -o /NewTophat/Directory OrganismAnnotation /Directory/With/Files/new.1.fastq Directory/With/Files/new.2.fastq
Versions: I used TopHat 2.0.9 and samtools 0.1.19.
Another possibility is that unmapped.bam contains a lot of reads with poor quality. When I try to omit those from the new fastq files (by running bam2fastx -q -A
instead of bam2fastx -q -Q -A
) and then rerun tophat, then I get an error, "could not find mate pair for read...." which suggests that a lot of the reads in unmapped.bam had one read of good quality and another mate of poor quality. So if I remove those low quality reads, tophat throws an error because it's expected paired end data but getting a mix of paired reads and reads whose mates were thrown out for being low quality. If that's the case, then surely there is some way to pull only unmapped mates of good quality from unmapped.bam?
When I run samtools flagstat unmapped_sort.bam
, this is the output:
8903498 + 2048970 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:0.00%)
8903498 + 2048970 paired in sequencing
4380074 + 1096160 read1
4523424 + 952810 read2
0 + 0 properly paired (0.00%:0.00%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:0.00%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
BUT if there really are 0 properly paired reads, then how in the TopHat2 output are there 62 concordant aligned pairs? I think unmapped.bam must be missing samtools flags. Any ideas for a workaround?
I sort the unmapped.bam file, then I use Bowtie2 to align the unmapped reads.
How do I check if a read in unmapped_sort.bam has a corresponding mate? I ran flagstat on unmapped_sort.bam and it seems to be missing that key information.
As for introns, I've read conflicting things about whether to use an intron size of zero for bacterial genomes or not. I'll try it and see if it helps.