I was given some TCGA BAM files and asked to perform a realignment with some specific requirements. While perusing the results of an alignment in IGV I noticed something strange. As far as I can tell, everything in the read data pop-up dialogs tells me that I'm looking at paired-end reads that mapped as pairs, except for the YT tag which is always UU
.
The read names in a mapped pair are 100% identical and pulled from separate FASTQ files. I'm seeing this with every read I check, and I've spot checked reads from random places on five different chromosomes.
Here's the tophat v2.0.9 command that I ran:
/usr/local/bin/tophat --output-dir /data/deedee/rnaseq/efb596b4 --max-multihits 2 -p 4 --b2-very-sensitive --library-type fr-unstranded /data/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome efb596b4_R1.fastq efb596b4_R2.fastq
Does anyone have any ideas about what's going on here? More background follows in case it's useful:
In the initial BAM file, the read names were a mess. They had /1
and /2
attached to the end of the read names, sometimes twice. I wrote a script to remove these /1
and /2
values from the ends of the read names. I used bedtools bamtofastq
to convert these query-sorted, cleaned BAM files to a pair of FASTQ files. From there I ran the tophat command above.
I wonder if this is an artifact of how the reads are aligned. Since the pairs are aligned separately, in part at least, I wonder if tophat just doesn't reset this auxiliary tag.
Very interesting suggestion. I'm going to pursue this further and see what I can find out. Thanks!
Please report back if that turns out to be the case (or not). I'd like to know as well!
I checked some output generated by a colleague and I'm seeing the same thing in those data as well. I bet your suggestion is correct. I went ahead and posted on the Tuxedo Tools message board to see if they can confirm.