Dear all,
I have paired end reads from my sample and the total number of raw reads is 132,938,316.
1) After applying trimmomatic, I have
R1 66,113,117 reads
R2 66,113,117 reads
total 132,226,234 reads
Output from Trimmomatic:
Input Read Pairs: 66469158
Both Surviving: 66113117 (99.46%)
Forward Only Surviving: 330055 (0.50%)
Reverse Only Surviving: 24010 (0.04%)
Dropped: 1976 (0.00%)
TrimmomaticPE: Completed successfully
Difference in reads pairs (raw-survived) => (66,469,158 - 66,113,117) = 356,041 (phred score <=20)
Difference in reads (Raw-Trimmed) => (132,938,316 - 132,226,234) = 712,082
2) After aligning using tophat
TopHat Output align summary:
Left reads:
Input : 66,113,117
Mapped : 54521571 (82.5% of input)
of these: 4509697 ( 8.3%) have multiple alignments (78742 have >20)
Right reads:
Input : 66,113,117
Mapped : 53610826 (81.1% of input)
of these: 4412079 ( 8.2%) have multiple alignments (78753 have >20)
81.8% overall read mapping rate.
Aligned pairs: 51538017
of these: 4275585 ( 8.3%) have multiple alignments
780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate.
TopHat Output Prep_reads_info:
left_min_read_len=20
left_max_read_len=101
left_reads_in =66,113,117
left_reads_out=66,079,886
right_min_read_len=20
right_max_read_len=101
right_reads_in =66,113,117
right_reads_out=66,070,354
Both left reads input and Left_read_in have 66,113,117 sample value,
Questions:
- What is the difference between
left_reads_out
(66,079,886) and left reads (54,521,571)? - What is the difference between
right_reads_out
(66,070,354) and right reads (53,610,826 )?
Why are these numbers different?
What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?
3) samtools flagstat accepted_hits.bam
131374951 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
131374951 + 0 mapped (100.00%:-nan%)
131374951 + 0 paired in sequencing
66297524 + 0 read1
65077427 + 0 read2
107575718 + 0 properly paired (81.88%:-nan%)
124948346 + 0 with itself and mate mapped
6426605 + 0 singletons (4.89%:-nan%)
9330692 + 0 with mate mapped to a different chr
810494 + 0 with mate mapped to a different chr (mapQ>=5)
Questions:
a) what is this QC-passed reads (131,374,951)?
b) What is this Read1 (66,297,524) and Read2 (65,077,427)?
4) samtools flagstat unmapped.bam
24,093,837 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:-nan%)
24093837 + 0 paired in sequencing
11,591,546 + 0 read1
12,502,291 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
5) samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
Unique Reads: 56,594,380
What about this number 56,594,380?
samtools view accepted_hits.bam | cut -f1 | sort | wc -l
131,374,951
samtools view unmapped.bam | cut -f1 | sort | uniq | wc -l
Unique Reads:14,575,100
6) I used htseq-count to get gene counts for accepted_hits.bam
Globin Genes Stranded-Yes(Sense) Stranded-reverse (Antisense)
HBB 36,472 7
HBA1 20,398 14
HBA2 144,174 79
HBG1 2,825 0
HBG2 2,638 0
HBD 4 0
HBE1 0 1
HBZ 0 0
HBQ1 3 6
MB 0 4
CYGB 2 354
NGB 2 319
Total Globin 206,518 784
Total reads 126,970,700 126,970,700
% of Globin_reads 0.1627 0.0006
Again, I used the accepted_hits.bam
as in the input for htseq-count.
Questions:
Why the total reads (126,970,700) is different from other numbers?
To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?
Your question(s) is too polluted and confusing. I will just point to the fact that if you did not sort "accepted_hits.bam" by name (samtools sort -n), htseq-count will could each read from a pair separately, effectively doubling your read counts for paired-end sequencing.
Dear h.mon,
Actually, I need some clarifications on each section of this pipeline. So I provided the read counts in each step I performed in my pipeline. From sequencing, I received 66,469,158 these many raw read pairs (R1 and R2). After trimming using trimmomatic, I received 66,113,117 trimmed reads pairs (R1 and R2). 356,041 read pairs don't have phred score >20. So they are removed. It is clear till this point.
As you suggested, now I sorted bam file based on name and provided the sorted to htseq-count. Below is the output
To get the percentage of reads for globin genes, is it correct to use following values? ((75,935)/(68,900,778))*100 = 0.1102%