How can I check for very high coverage region in a bam file. I've several samples with good alignment rate (~90%) but when I count the reads for known genes (ensembl), I don't have good results compared to previous data. I think it's rRNA or other contamination. So how can I check for high coverage region in my data (more than 10000 reads per position) ?
I was going to suggest 'samtools depth' but that doesn't appear in the manual anymore, might still be functioning though (works on my cluster)... Essentially, depth used to give number of reads aligning to a region. So you would give it a BAM and the region and reference.fa and get your (per base) 'depth', then awk out those higher than 10000.
You could use the GATK DepthOfCoverage walker with a provided BED file. Usually these are per exon but any sort of intervals will do. I believe you can get per position coverage information from that tool as well. Otherwise for intervals you can set what sort of percentage/depth cutoffs you want to report.
You can generate a mpileup output with samtools. Then use a script to scan through the file to get the top X bases with the highest read mapping. You can use this top X list to narrow down the regions with highest coverage.
To get counts for each base, you can simply count the number of '.' or ',' in the 4th column of the mpileup output. A '.' (period) stands for read mapping to forward strand and ',' (comma) stands for read mapping to reverse strand.
Various relevant methods are discussed here: What is the fastest method to determine the number of positions in a BAM file with >N coverage?