This is a very long question. But hopefully some of you will find it interesting.
BACKGROUND
I was given several hundred BAM files representing samples from cancer patients. Judging by the read names and quality scores, these were HiSeq PE-50 reads. The BAMs were created using MapSplice and RSEM.
I was asked to take these BAM files and realign the reads with parameters such that there would be no discordant alignment and no mismatches. The investigator also wanted to omit any reads which mapped to more than one location under these criteria. The read names on the original BAMs all ended with one or more of /1
or /2
, indicating which one of the pair the read belonged to. All of the BAM to FASTQ conversion scripts I tried didn't like this convention, so I wrote a quick script to trim the /1
and /2
off of each read name. I sorted the BAMs by read name and was then able to create FASTQ.
PROCEDURE
After some trial and error I settled on the following parameters for TopHat2 2.0.9:
--no-mixed --max-intron-length 800000 --no-discordant --output-dir [output] --max-multihits 2 -p 8 --b2-very-sensitive --library-type fr-unstranded [genome] [fastq1] [fastq2]
For the --max-multihits 2
, my intention was to assess the proportion of reads that mapped to more than one location, examine them in IGV and perhaps calculate specific metrics based on what I found, and then filter them out in accordance with initial requirements afterwards.
While the original TCGA BAMs had some messy alignments (inconsistent read-pair orientation, lots of singleton and discordant alignments, highly-inconsistent insert size), the alignments I got with the Tophat2 parameters above resulted in some nice-looking alignments with an alignment rate of 94%.
PROBLEM
The investigator and his team were happy with what they saw, until one person noticed one particular splice junction that wasn't being replicated in the TopHat alignments. It was located at chr8:128,427,270 - 128,427,687. Here's a screenshot from IGV which shows what I'm talking about. The top half shows my alignments. The bottom half shows the original alignments.
TROUBLESHOOTING
I examined the CIGAR strings for the reads and saw that they were mapping perfectly. Each had a CIGAR string akin to [X]M[Y]N[Z]M
, indicating X
perfectly matched bases, an insert of Y
bases, and Z
perfectly matched bases. The original program didn't create an alignment score, but the mapping quality scores were typically 66.
I created a text file with the read names, and used this to search the TopHat2 accepted_hits
and unmapped.bam
files. To my surprise, all of the reads were turning up in unmapped.bam
.
Given that the unmapped.bam file was very small in relation to the accepted_hits.bam file, (150 MB vs 8 GB) I decided to take a brute-force approach with these mysterious reads. I converted the unmapped.bam read pairs into two FASTQ files. I ran these FASTQ files through TopHat with increasing levels of leniency. I also tried aligning the reads as single reads instead of paired-end. Some of the settings I used include:
Being more lenient by allowing discordant & mixed alignments and using --b2-very-fast
. I also tried leaving off the --library-type
parameter.
--max-intron-length 800000 --output-dir [output] --max-multihits 2 -p 8 --b2-very-fast --library-type fr-unstranded [genome] [fastq1] [fastq2]
This resulted in at 38% overall read alignment rate
Mapping the FASTQ files separately as single-read (I also used --b2-very-fast):
--max-intron-length 800000 --output-dir [output] --max-multihits 2 -p 8 --b2-very-sensitive [genome] [fastq1]
When I do this with lenient settings, 61.2% of the forward reads and 39.2% of the reverse reads from the unmapped reads align.
Supplying a GTF file:
--no-mixed --max-intron-length 800000 --no-discordant --output-dir [output] --max-multihits 2 -p 8 --b2-very-sensitive --library-type fr-unstranded -G [UCSC genes.gtf] [genome] [fastq1] [fastq2]
This resulted in a 39.6% overall read alignment rate.
I looked at each of the accepted_hits files in IGV, but in every case the read which originally mapped across the splice junction in the original BAM would not replicate that splice junction. For the single-read files, I saw that the read which was the mate of the read aligning across the splice junction of interest mapped to the genome, usually in the same location as the original alignments. Rarely, a read which originally mapped to the splice junction of interest would map elsewhere in the genome, but typically the read would end up in unmapped.bam
.
I'm totally baffled. I verified that each base of these reads which TopHat throws out is a perfect match to the reference. While I have found a few other splice junctions in the original TCGA BAM files which aren't replicated in the TopHat alignments, the vast majority of splice junctions are, and these matches have a wide range of size.
Does anyone have any idea what could be going on here?
UPDATE 11 October
I excised this 2000-base region of interest containing the mystery splice junction from the reference genome and aligned the original set of reads (not just the unmapped) against this tiny genome snippet in single-read mode using the following command:
--segment-length 15 --output-dir [output] --b2-very-sensitive ./chr8roi/chr8roi [fastq1/2]
I found that, when I combined the results of both alignments, most of the reads with the original splice junction of interest mapped to this ROI and included the splice! But these are the same reads that get thrown away by Tophat when I align against the full reference! I even double-checked by running the same command against the full reference:
-p 8 --segment-length 15 --output-dir [output] --b2-very-sensitive [genome] [fastq1/2]
Still confused. Surely multi-threading isn't the problem? I'm going to do a single-threaded run over the weekend and see if that makes any difference.
This region appears to be highly repetitive (looking at RepeatMasker track in UCSC). My guess is that the reads are mapping to more than 2 locations, so the "--max-multihits 2" command is causing the alignment to not appear. Have you tried to Blat one of the reads to see how many results come up? Also check the IH tag in the MapSplice bam to see the number of locations that the read is mapped to.
Thanks for the suggestion Chris. I will definitely Blat these. However, all of these reads are showing up in unmapped.bam, indicating that TopHat did not align them anywhere. Your observation about the repetitiveness is helpful too.
Interesting observation. In the mapslice results, were all of the reads indicating splicing there split reads (i.e., they had N in the CIGAR string) or were there also some where reads in a pair mapped on opposite sides of the junction (i.e., they gave evidence for the junction but lacked an N CIGAR operation)?
Good question @dpryan79. In the mapsplice results, all of the reads have CIGAR strings matching the
[X]M [Y]N [Z]M
pattern. Thanks!Interesting, without any pairs that span that junction, I wonder if it's actually real. Have you tried decreasing --segment-length (to maybe 20) or, better yet, done a run of the unmapped reads with --keep-tmp? Tophat should be splitting those reads into segments (default size 25bp) and then trying to map the segments. With --keep-tmp, you might be able to track how some of those individual segments are mapping.
I have not but I will definitely do that right now. Thanks for the tip! This is exactly what I was looking for--ideas!
From your update, I wonder if these reads are otherwise multi-mapping. I've never looked to see how tophat2 deals with segments (as opposed to whole reads) that multi-map.
Dan: Thanks for sharing your experience with us. After all you have examined, I was wondering which of these tools you recommend for splice junction study( MapSplice or Tophat2).