I have aligned paired-end reads with TopHat2. In the resulting BAM file, there are reads that do map as "read map in proper pair" (their flags "include" the flag 2) but map on different chromosomes (!).
I have called TopHat2 with parameters --mate-inner-dist = -139, --mate-std-dev = 50. Unless I misunderstand something about the definitions of the terms, could it be that a negative mate-inner-dist messed something up?
I think that a read "mapped in proper pair" is the same as "concordant alignment". The definition of the latter is:
A pair that aligns with the expected relative mate orientation and with the expected range of distances between mates is said to align "concordantly".
These are two reads out of the mapped file :
A01056:33:HF3NFDSXY:1:2516:13657:30718 435 1 91387362 0 117M 21 8218147 0 CCTGTGGTAACTTTTCTGACACCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAGCGAGCTTTTGCC :FF:F:FFFF:FFFFFFFFFFFFFF:FF,FFF,FFFFFF:FFF:FFFFF:FF:FF:FFFFFFF:FFFFFFFFF:FFFFFFFFF,FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:117 YT:Z:UU NH:i:20 CC:Z:= CP:i:91387362 XS:A:- HI:i:2
A01056:33:HF3NFDSXY:1:2516:13657:30718 371 21 8218147 0 112M 1 91387362 0 GGGCAAAAGCTCGCTTGATCTTGATTTTCAGTACGAATACAGACCGTGAAAGCGGGGCCTCACGATCCTTCTGACCTTTTGGGTTTTAAGCAGGAGGTGTCAGAAAAGTTAC :F:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFFFFFF:FFFF::FFFFFFFFFF:FFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:112 YT:Z:UU NH:i:20 CC:Z:GL000220.1 CP:i:161594 XS:A:+ HI:i:2
And this is the command used to generate the alignment
tophat --mate-inner-dist -139 --mate-std-dev 50 -o align/Sample10 -G /.../Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf -N 10 --read-gap-length 5 --read-edit-dist 15 --segment-length 20 --read-realign-edit-dist 3 --no-coverage-search --library-type fr-firststrand -p 32 /.../Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/genome processed/Sample10_R1_clean_pe.fastq.gz processed/Sample10_R2_clean_pe.fastq.gz,processed/Sample10_R1_clean_se.fastq.gz,processed/Sample10_R2_clean_se.fastq.gz
( _pe files are for paired reads. _se files were also sequenced paired end; but during the pre-processing cleaning part, only one of the pair of reads remained)
FYI : https://twitter.com/lpachter/status/937055346987712512?lang=fr
Thanks, but nevertheless I'd still be glad if anyone can help me with this issue.
In the majority of the cases for this file however, the error is not the sam file flag, but the YT:Z:UU flag.
In this run, tophat has received both PE and SR reads. About 98% were PE. Despite that (subsampling the file), about 98% are mapped with the YT:Z:UU flag.
This Is There An Explanation For This Tophat "Yt" Descriptor Discrepancy In My Sam Output?
was on a similar topic.