Hey guys!
I am currently working on the analysis of NGS sequencing data and have some questions regarding the methodology for calculating base coverage using .bam files.
Initially, I merged the .bam files and utilized the samtools depth command to calculate the coverage. Then, I used a .bed file containing gene coordinates to identify the coverage for each gene. However, I later became concerned about whether this approach was appropriate, given that all samples were aligned to the same reference.
As an alternative, I calculated the coverage for each sample individually and then computed the average base coverage for each gene. However, the results were significantly different and much lower than expected compared to the previous approach.
I am particularly interested in understanding gene performance/coverage rather than coverage on a per-sample basis. I need base-by-base coverage information to determine which areas are not adequately covered. Given this context, I would like to know:
- Is it appropriate to merge .bam files and calculate coverage using samtools depth this way?
- Is calculating the average base coverage valid for understanding gene performance? What factors could lead to discrepancies in the results?
- Are there best practices for calculating base-by-base coverage for specific genes, rather than averaging across samples?
I appreciate any insights or suggestions you may have to share!
This is not making complete sense. If a base is covered by "m" reads in Sample_A and "n" reads in Sample_B then when you merge the BAM's the coverage should be additive. How did you merge the BAM's, you did not do any duplicate marking/removal correct?
Can you use
pandepth
(LINK) ormosdepth
(LINK) and check.