Hi guys, I have mapped illumina paired end RNA-Seq data with latest tophat2 and bowtie2. I have 2 simple questions.
- All the time the bam files reports 100% mapping (samtools flagstats). Does tophat+bowtie pipeline really mapped 100% of the reads ?
- In one sample properly paired reads percentage is around 82% and in another 25%. Does it mean in the second case mapping is bad ?
1st case
177722025 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
177722025 + 0 mapped (**100.00%**:-nan%)
177722025 + 0 paired in sequencing
90536514 + 0 read1
87185511 + 0 read2
146263928 + 0 properly paired (**82.30%**:-nan%)
166719032 + 0 with itself and mate mapped
11002993 + 0 singletons (6.19%:-nan%)
8545608 + 0 with mate mapped to a different chr
383634 + 0 with mate mapped to a different chr (mapQ>=5)
2nd case
36171595 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
36171595 + 0 mapped (**100.00%**:-nan%)
36171595 + 0 paired in sequencing
18565905 + 0 read1
17605690 + 0 read2
9163074 + 0 properly paired (**25.33%**:-nan%)
29079258 + 0 with itself and mate mapped
7092337 + 0 singletons (19.61%:-nan%)
4041802 + 0 with mate mapped to a different chr
139464 + 0 with mate mapped to a different chr (mapQ>=5)
[gbogu@pizza bam]$ samtools flagstat es_rep3.bam
TopHat writes the unmapped reads in a separate BAM file.
Maybe the pipeline does not write alignments for unmapped reads?
@Jared : When I counted the no.of lines in right and left end fastq files, I had different numbers. In 2nd case for example, 153181608 lines were there. Does it mean my mapping percentage is only 23% (36171595/153181608) ? Is this expected ? Do you have any idea about general mapping percentage of paired end RNA-seq reads ?
I understood that. Thanx. I just checked ENCODE BAM files as well. If I count the number of mapped reads in BAM file by the total number of reads in FASTQ and multiply with 100, I am only getting ~30% mapping for almost all the files. Is this how one calculates mapping percentage ?
You should be extremely cautious with samtools flagstats. It does count mapping loci and not mappable mates. If one single mate maps to multiple loci on the genome, samtools falgstat counts it multiple times. A simple way to get the numer of mappable mates would be 'samtools view -f 0x40 file.bam | cut -f1 | sort | uniq | wc -l' for the first mate and 'samtools view -f 0x80 file.bam | cut -f1 | sort | uniq | wc -l' for the second mate. Sum up the two numbers and you end up with the number of mappable mates. When dividing this number with the number of input mates, you get your percentage of interest. To get the number of input mates, you can use 'cat input_1.fastq | paste - - - - | wc -l' and multiply it with 2, because of the second mate. Getting the number of mapped paired-mates is not so trivial... therefore I would recommend writing a small script.
@David: You should have posted your comment as answer. thanx. What is the purpose of paste --- ? I can just use wc -l file.fastq ? Isn't the same ?
Paste takes as many lines as you write the '-' character and returns them in one line (tab-sep). Basically you get one fastq-entry in one line, instead of four. Just try 'cat input.fastq | head' and 'cat input.fastq | paste - - - - | head' and see what I mean. When you do 'cat input.fastq | wc -l', you have to divide the number by four.
I use 'paste' a lot for modifying fastq-files in the command line. You don't need any counter to distinguish in what line you are (header, sequence, info, or quality), since they are just column 1-4 now.
what is the "new" Tophat2? O.,o
Doesn't
align_summary.txt
in the Tophat output file give this information?