I have some single cell sequencing data and I did some quick statistics on them to give feedback about the sequencing kits used (RepliG and PicoPlex). For some samples, I'm getting some discrepancies in the numbers and I was wondering if that's expected or is it an issue.
Here's some data for one of the sample:
samtools flagstat results
66783127 + 0 in total (QC-passed reads + QC-failed reads)
8141047 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
66760206 + 0 mapped (99.97% : N/A)
58642080 + 0 paired in sequencing
29321040 + 0 read1
29321040 + 0 read2
53618530 + 0 properly paired (91.43% : N/A)
58597314 + 0 with itself and mate mapped
21845 + 0 singletons (0.04% : N/A)
274662 + 0 with mate mapped to a different chr
190889 + 0 with mate mapped to a different chr (mapQ>=5)
According to samtools stats read length is 151 for 58,642,080 reads. So total bases = 151 * 58642080 = 8,854,954,080
Breadth and depth calculated based on samtools depth :
breadth = 35.44% depth = 26.49
Total bytes / sum of all depths as per samtools depth for all bases = 2,234,214,697
Total bytes / sum of all depths as per bedtools genomecov for all bases = 7,909,149,477
Breadth and depth as per bedtools coverage (depth is higher than before when calculated using samtools depth which is expected)
breadth = 35.34% depth = 123.304
As per my understanding, samtools depth only counts the reads with primary alignment and ignore secondary/supplementary alignments whereas bedtools genomecov counts everything that's in the bam file.
Now, my question is this, since there's a huge difference between the numbers in points 3 & 4 and point 4 is closer to point 1, does this mean this sample only has about 1/4th reads (designated as primary alignments) which are useful? I'm still trying to figure out if we need to eliminate the secondary/supplementary alignments and just keep primary ones or use all of them.
Plotting % targeted bases v read depth shows only about 5% bases with read depths 10+. Do you have any suggestions how this data can be improved or what is causing such issues? I see this in few single cell samples and not all of them.
TIA