Entering edit mode
11.1 years ago
Pierre Lindenbaum
164k
EDIT: Cross posted on SE: http://seqanswers.com/forums/showthread.php?p=122835
I run tophat2 using the standard options.
$ tophat2 -v
TopHat v2.0.10
$ samtools view -H TOPHAT/accepted_hits.bam | grep PG
@PG ID:TopHat VN:2.0.10 CL:/commun/data/packages/tophat-2.0.10.Linux_x86_64/tophat -p 10 -G genes.gtf -o TOPHAT --rg-id g24 --rg-library 6VGWT3 --rg-sample 6VGWT3 --rg-description 6VGWT3 34 fastqs --rg-platform-unit 1 2 3 4 --rg-center Nantes --rg-platform Illumina mm10 6VGWT3_ATGTCA_L002_R1_002.fastq.gz,6VGWT3_ATGTCA_L004_R1_002.fastq.gz,6VGWT3_ATGTCA_L003_R1_002.fastq.gz,6VGWT3_ATGTCA_L002_R1_003.fastq.gz,6VGWT3_ATGTCA_L004_R1_003.fastq.gz,6VGWT3_ATGTCA_L003_R1_003.fastq.gz,6VGWT3_ATGTCA_L002_R1_004.fastq.gz,6VGWT3_ATGTCA_L004_R1_004.fastq.gz,6VGWT3_ATGTCA_L003_R1_004.fastq.gz,6VGWT3_ATGTCA_L004_R1_001.fastq.gz,6VGWT3_ATGTCA_L002_R1_001.fastq.gz,6VGWT3_ATGTCA_L003_R1_001.fastq.gz,6VGWT3_ATGTCA_L001_R1_002.fastq.gz,6VGWT3_ATGTCA_L001_R1_003.fastq.gz,6VGWT3_ATGTCA_L001_R1_004.fastq.gz,6VGWT3_ATGTCA_L001_R1_001.fastq.gz 6VGWT3_ATGTCA_L002_R2_002.fastq.gz,6VGWT3_ATGTCA_L004_R2_002.fastq.gz,6VGWT3_ATGTCA_L003_R2_002.fastq.gz,6VGWT3_ATGTCA_L002_R2_003.fastq.gz,6VGWT3_ATGTCA_L004_R2_003.fastq.gz,6VGWT3_ATGTCA_L003_R2_003.fastq.gz,6VGWT3_ATGTCA_L002_R2_004.fastq.gz,6VGWT3_ATGTCA_L004_R2_004.fastq.gz,6VGWT3_ATGTCA_L003_R2_004.fastq.gz,6VGWT3_ATGTCA_L004_R2_001.fastq.gz,6VGWT3_ATGTCA_L002_R2_001.fastq.gz,6VGWT3_ATGTCA_L003_R2_001.fastq.gz,6VGWT3_ATGTCA_L001_R2_002.fastq.gz,6VGWT3_ATGTCA_L001_R2_003.fastq.gz,6VGWT3_ATGTCA_L001_R2_004.fastq.gz,6VGWT3_ATGTCA_L001_R2_001.fastq.gz
I found some sam flags 115 !
115=
- read paired
- read mapped in proper pair
- read reverse strand
- mate reverse strand
- first in pair
.
$ samtools view -f 115 -F 256 dir/accepted_hits.bam | head -n 2
HWI-1KL149:61:D2C11ACXX:4:2204:6848:94129 115 chr1 24611547 1 70M2D31M chrM 10906 0 GTAGGCGATTAGTGATTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCATAGGGAGAAGGATGAAGGGGTATGCTATATATTTT DDDDDBDDDDDEEEEEEFFFFFFFHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJHHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:70^GA31 YT:Z:UU NH:i:4 CC:Z:chrM CP:i:10928 HI:i:2 RG:Z:g24
HWI-1KL149:61:D2C11ACXX:2:2109:12004:4228 115 chr1 24611549 1 101M chrM 10815 0 TGGCGATTAGTGATTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCATAGGGAGAGAAGGATGAAGGGGTATGCTATATATTTT DDDDDDDDDDDEEEEEEEFFFFFHHHJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJIJJJJIGJJJJJJJJJJIJJJJJJJJJJJJHHHHHFFFFFCCC AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0A100 YT:Z:UU NH:i:4 CC:Z:chrM CP:i:10928HI:i:2 RG:Z:g24
how can a read be "mapped in proper pair" and read reverse strand+ mate reverse strand ? what is the consequence for a tool like htseqcount ? Does it only count the reads in proper pair ?
Note the
HI:i:2
auxiliary flag. When I still used tophat2, I never had it output more than a single hit. I presume that this is the source of the issue. I suspect that if you grep for those reads, you'll find that the mate is forHI:i:1
, which would make sense when looking at the RNEXT, PNEXT and TLEN columns and theCC
andCP
auxiliary fields. This is my only guess as to why this might be the case (I'd post this as an answer if I actually knew!).Regarding htseq-count, just use the MAPQ threshold and set it above 4 to get rid of multimappers from tophat.
you have strange RNEXT, PNEXT and TLEN columns as well. why is the chromosome name there? shouldn't be another read name or
=
yes. That's strange but that's really the raw output for tophat2.
no idea here ? I'll ask SE...