Question

Compute Binned Gc-Normalized Read Counts From Bam File

8

Entering edit mode

11.3 years ago

Christian ★ 3.1k

Binned read counts can be readily obtained from BAM files using for example 'samtools depth' or bedtools. However, I would be interested in getting binned read counts normalized by GC content, because GC content correction is clearly important to accurately quantify fragment abundance (see for example http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378858/).

I know that there are several solutions out there to do this, but they seem to be quite involved.

What is the most straightforward way (i.e. with the least lines of code) to get GC normalized read counts from BAM files? Any programming language would be ok. I am working on Illumina paired-end reads and human samples.

gc normalization depth-of-coverage • 12k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 11.3 years ago by Christian ★ 3.1k

Ram · Answer 1 · 2014-02-09

14

Entering edit mode

11.3 years ago

Aaronquinlan 12k

One can do this with a mix of bedtools (version 2.18.2 and later) and basic scripting. The example is based on human, build hg19.

Make a BED file of windows (say 100Kb) across the genome.

bedtools makewindows -g hg19.chromsizes \
                     -w 100000 \
> hg19.100Kb.windows.bed

Get the observed BAM coverage for each window (your files must be position-sorted). The -c option counts the number of reads overlapping each window.

bedtools intersect -a hg19.100Kb.windows.bed \
                   -b aln.bam \
                   -c -sorted \
>  hg19.100Kb.windows.counts.bedg

Append GC content for each window. The sixth column of the output will be the GC content.

bedtools nuc -fi hg19.fa \
             -bed hg19.100Kb.windows.counts.bedg \
| cut -f 1-4,6 \
>  hg19.100Kb.windows.counts.gc.bedg

Now you need to normalize by GC. The best way to do this is in two pases. First, loop through the file above and keep a list of the depths observed for each GC bin. I would break the GC into bins with two decimal precision. For example, tracking this, you might observe:

GC = 0.55. Depths = [235, 342, 288, etc.]
GC = 0.65. Depths = [110, 134, 155, etc.]

The counts for each bin should be roughly normally distributed. Thus, if you compute the mean and standard deviation of the counts for each GC bin above, then, in a second pass through the file, you can convert the observed count for each genomic window into a Z-score, reflecting the number of standard deviations that the count is from the mean count observed for all genomic windows with the same GC (the fifth column in the file we created in step 3). For example, Z=3 indicates that the count is 3 standard deviations higher than the mean for that GC bin. Negative 3 means 3 standard deviations lower. One would compute the Z score as follows: (observed_depth - mean_depth_for_GC) / stddev_for_GC.

Here is a little script I wrote a few months ago for this. It uses the pybedtools package to combine Python with the bedtools concepts outlined above. Hopefully this will provide a sense of how to go about the above strategy or modify it to your specific needs.

https://github.com/arq5x/codachrom/blob/master/windowed_coverage.py

ADD COMMENT • link 11.3 years ago by Aaronquinlan 12k

0

Entering edit mode

This looks quite slick and useful, thanks! However, I can't get the 'bedtools intersect' to work, because it does not accept a bam file as the second file (-b parameter). I am using bedtools version 2.17.0. Also, any ideas how to get normalized read counts instead of Z scores?

ADD REPLY • link 11.3 years ago by Christian ★ 3.1k

0

Entering edit mode

Sorry, you must use version 2.18.2 or later. Preferably 2.19, which was just released the other day. Will edit the answer.

ADD REPLY • link 11.3 years ago by Aaronquinlan 12k

0

Entering edit mode

I have used bedtools bam2bed result as a pipe and it works.

ADD REPLY • link 11.0 years ago by zongzhi.liu • 0

0

Entering edit mode

It is possible, that bedtools nuc counting CG content from fasta - no from BAM file? Do you have any idea how to calculate cg content from BAM??? Thank you for an advice...

ADD REPLY • link 10.5 years ago by Paul ★ 1.5k

0

Entering edit mode

Thank you for good command. But I cannot run this command

bedtools intersect -a hg19.100Kb.windows.bed \
                   -b aln.bam \
                   -c -sorted \
>  hg19.100Kb.windows.counts.bedg

I have this error below

***** WARNING: File 55-096.sorted_1.bam has inconsistent naming convention for record:
1    10000    10041    MG00HS09:556:C59KEACXX:7:2307:13289:74902/1    12    +
ERROR: Sort order was unspecified, and file 55-096.sorted_1.bam is not sorted lexicographically.
       Please re-reun with the -g option for a genome file.
       See documentation for details.

I'm bioinformatics student please suggest for me

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.8 years ago by tanhugratay • 0

0

Entering edit mode

Can I compute mappability like GC content from the bam file?

ADD REPLY • link 8.4 years ago by fatima.m.zare ▴ 30

0

Entering edit mode

I am wondering how can I remove GC content bias with Loess regression?

ADD REPLY • link 8.4 years ago by fatima.m.zare ▴ 30