One can do this with a mix of bedtools (version 2.18.2 and later) and basic scripting. The example is based on human, build hg19.
Make a BED file of windows (say 100Kb) across the genome.
bedtools makewindows -g hg19.chromsizes \
-w 100000 \
> hg19.100Kb.windows.bed
Get the observed BAM coverage for each window (your files must be position-sorted). The -c
option counts the number of reads overlapping each window.
bedtools intersect -a hg19.100Kb.windows.bed \
-b aln.bam \
-c -sorted \
> hg19.100Kb.windows.counts.bedg
Append GC content for each window. The sixth column of the output will be the GC content.
bedtools nuc -fi hg19.fa \
-bed hg19.100Kb.windows.counts.bedg \
| cut -f 1-4,6 \
> hg19.100Kb.windows.counts.gc.bedg
Now you need to normalize by GC. The best way to do this is in two pases. First, loop through the file above and keep a list of the depths observed for each GC bin. I would break the GC into bins with two decimal precision. For example, tracking this, you might observe:
GC = 0.55. Depths = [235, 342, 288, etc.]
GC = 0.65. Depths = [110, 134, 155, etc.]
The counts for each bin should be roughly normally distributed. Thus, if you compute the mean and standard deviation of the counts for each GC bin above, then, in a second pass through the file, you can convert the observed count for each genomic window into a Z-score, reflecting the number of standard deviations that the count is from the mean count observed for all genomic windows with the same GC (the fifth column in the file we created in step 3). For example, Z=3 indicates that the count is 3 standard deviations higher than the mean for that GC bin. Negative 3 means 3 standard deviations lower. One would compute the Z score as follows: (observed_depth - mean_depth_for_GC) / stddev_for_GC.
Here is a little script I wrote a few months ago for this. It uses the pybedtools package to combine Python with the bedtools concepts outlined above. Hopefully this will provide a sense of how to go about the above strategy or modify it to your specific needs.
https://github.com/arq5x/codachrom/blob/master/windowed_coverage.py
This looks quite slick and useful, thanks! However, I can't get the 'bedtools intersect' to work, because it does not accept a bam file as the second file (-b parameter). I am using bedtools version 2.17.0. Also, any ideas how to get normalized read counts instead of Z scores?
Sorry, you must use version 2.18.2 or later. Preferably 2.19, which was just released the other day. Will edit the answer.
I have used bedtools bam2bed result as a pipe and it works.
It is possible, that bedtools nuc counting CG content from fasta - no from BAM file? Do you have any idea how to calculate cg content from BAM??? Thank you for an advice...
Thank you for good command. But I cannot run this command
I have this error below
I'm bioinformatics student please suggest for me
Can I compute mappability like GC content from the bam file?
I am wondering how can I remove GC content bias with Loess regression?