I am trying to get a better understanding of how tophat works. To do this I have created some toy examples. I am looking at a splice junction between two exons in chr9 of hg38. Exon 1 and exon 2 are at base positions 133261900 - 133262400 and 133274600 - 133276150 respectively. Below is my synthetic data
Read 1 (R1.fastq):
@exon 2, line 04...
ATTATTAGGAAAAGGATCATAGGTCGAAGTGCGTGGCATTTTGGTTTTCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@near exon 1...
GCCTGTTGCCCAGGTGAGGCTAAAATAAGGGAGCAAGTAAGCACACCCTC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Read 2 (R2.fastq):
@exon 1, line 05-06... # starts here
GGCTCCGCGTCTGGTCTCGGCCTCCGCCTTCCGCCCGGCGGCCCTGGGGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@near exon 1
CGCCAGAGCTGGACGCAGGCAATAACTGTCCTTAGGACCCTGATAACTGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
The gene is on the opposite strand with respect to the reference and I am trying to emulate paired end reads. The first entry of R1.fastq is part of exon 2 and first entry of R2.fastq is part of exon 1. The result of running
tophat2 -p 2 -o . --library-type fr-unstranded /path/to/hg38/Bowtie2Index/genome R1.fastq R2.fastq
yields 2 discordant pair alignments, but no junctions. Replacing the first line of R2.fastq with its reverse complement yields 2 concordant pair alignments, but no junctions.
Question : How would I modify my data set to get tophat to recognize a junction?