Hi Biostars community,
I recently built a script around the commands from this question in order to obtain read depth data for about 22,000 small regions (<=50bp each). The BAM file was from whole-exome NGS at the 1000 Genomes Project:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/other_exome_alignments/HG00096/exome_alignment/HG00096.mapped.illumina.mosaik.GBR.exome.20111114.bam
To choose my 22,000ish regions, I used 1000 Genomes' list of pulldown targets, and I used liftover to convert them to hg18 coordinates because that's what they align reads to in this paper (see supplement). I looked only at chromosome 22. For any region larger than 50 bp, I split it up.
My question: why are there only 878 regions with nonzero read depth? Is that to be expected?
Thanks for your advice.
The paper you are citing is the "pilot" paper written six years ago. The more recent the phase1 paper. 1000g is using hg19/GRCh37 throughout right now.
Phase 3 paper is coming out. The data is up on the FTP though.
I switched to HG19 coordinates, but the problem persists. It would help to have a fast, well-tested method to obtain read-depth from BAM files. Currently, it takes me hours to process a single chromosome, and both of the existing APIs that I tried (Pysam and Pybedtools) either crashed my Python kernel or crashed my entire computer.
No need to use pysam/pybedtools. Just: samtools depth -b interval.bed -r 22 input.bam.
In fact, switching to HG19 coordinates has resolved the problem. And thank you for the advice on samtools depth. I will try that.