Question

Calculating GC content from BAM file by BED file

3

Entering edit mode

10.0 years ago

Paul ★ 1.5k

Dear all,

I have bedg file created from bedtools -

chr1 1  10 num_of_reads

chr1 10 20 num_of_reads

chr1 20 30 num_of_reads

.

.

.

3th column is information about overlapping reads at this region. I would like to add to 4th column information about GC content in reads from my BAM file. Do you have any idea how to do that? Thank you for any help.

So result should be:

ch start stop num_of_reads cg_content

cg content bam bed bedtools • 13k views

ADD COMMENT • link updated 2.6 years ago by Harry ▴ 10 • written 10.0 years ago by Paul ★ 1.5k

1

Entering edit mode

It is important consider that such measurements will be biased by coverage

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Aaronquinlan 12k

0

Entering edit mode

Should the cg_content column be expressed as a percentage of bases which are G or C?

ADD REPLY • link 10.0 years ago by Dan D 7.4k

0

Entering edit mode

better would be just a number - cg content - I now how to calculate separately from bam -

awk '{ n=length($10); print $10, gsub(/[GCCgcs]/,"",$10)/n;}' your.sam

but I need to calculate according BED file.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Paul ★ 1.5k

4

Entering edit mode

10.0 years ago

Aaronquinlan 12k

You could combine the bedtools map tool with awk to do this. Specifically, the example below uses map to count the number of overlapping reads for each interval (-c 10 -o count) and concatenate all of the nucleotide sequences in each overlapping BAM alignment (-c 10 -o collapse). Thus the fourth column of the output is the count of overlapping reads and the fifth column is the concatenated sequence from each read. Awk is then used to compute the GC fraction from this concatenated sequence.

➜  cat test.bed
20    0    1000000
20    1000000    2000000
20    2000000    3000000
➜  bedtools map -a test.bed \
                 -b <(curl -s ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam) \
                 -c 10,10 \
                 -o count,concat \
| awk -v OFS="\t" '{n=length($5); gc=gsub("[gcGC]", "", $5); print $1,$2,$3,$4,gc/n}'

20    0    1000000    47199    0.45724
20    1000000    2000000    47929    0.445155
20    2000000    3000000    49902    0.445785

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Aaronquinlan 12k

0

Entering edit mode

Wau respect - it works - very elegant solution!!!

ADD REPLY • link 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

I tried it and there is some problem with bedtools, i received and right information for chr 1-9 but for chr 10-21 I received only zeros at all. Do you know were could be problem. Segmentation on 60kb.

ADD REPLY • link 10.0 years ago by Korsocius ▴ 260

1

Entering edit mode

In order for the map tool to work correctly, all files must use the same chromosome sorting order. Based on what you describe, it sounds like the order of chromosomes in your BED file differes from the BAM file. Our next release will be able to detect these changes automagically.

ADD REPLY • link 10.0 years ago by Aaronquinlan 12k

0

Entering edit mode

It is ok now. There was a bug in older version...

ADD REPLY • link 9.9 years ago by Korsocius ▴ 260

0

Entering edit mode

I am really amazed by the simplicity and awesomeness of this solution.

Just one change I would like to make it to work would be

| awk -v OFS="\t" '{n=length($6); gc=gsub("[gcGC]", "", $6); print $1,$2,$3,$4,gc/n}'

As $5 is the count of the number of reads that overlapped with the bed region

ADD REPLY • link 5.3 years ago by Gjain 5.8k

0

Entering edit mode

This doesn't seem to be the case.

Below is the output of the first portion of the one-liner before the AWK command.

chrX    301493  301678  2      ACGGCATACGCGATAACGCTATGTGTCTGGAGTTAAGAG
chrX    305864  306261  2      TGACTCGGCCTTAGCCTCCCGAGTAGCTGGGACTACAGA

Grabbing field #5 would be the correct move here. There isn't a field 6 at all.

ADD REPLY • link 2.6 years ago by Harry ▴ 10

2

Entering edit mode

10.0 years ago

Alex Reynolds 36k

Building off my previous answer (How To Calculate A Region‘ Gc Content In Reference Genome？？), you could first convert the BAM reads into FASTA with samtools and awk, then pipe that result to a second awk script that converts FASTA to a BED-like result that includes columns with per-FASTA element GC content:

$ samtools view reads.bam \
    | awk ' \
    BEGIN { \
        OFS="\t"; \
    } \
    { \
        print ">"$3"-"($4-1)"-"($4+$9)"\n"$10 \
    }' - \
    | awk ' \
    BEGIN { \
        FS=""; \
        cg=0; \
        t=0; \
    } \
    { \
        if ($1 != ">") { \
            for (i = 1; I <= NF; i++) { \
                t++; \
                if ($i ~ /[CGcg]/) { \
                    cg++;
                } \
            } \
        } \
        else { \
            if (t > 0) { \
                print e"\t"cg"\t"t"\t"(cg/t); \
                cg = 0; \
                t = 0; \
            } \
            h = substr($0, 2); \
            split(h, a, "-"); \
            e = a[1]"\t"a[2]"\t"a[3]; \
        } \
    } \
    END { \
        print e"\t"cg"\t"t"\t"(cg/t); \
    }' - \
    > answer.bed

You'll need to modify this (untested) pipeline to get the exact columns you want, but I think this should get you about 95% of the way there.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you Alex for comment and nice code, thats work perfectly fine, but output is in coordinate from my bam2fasta - I need calculate cg content in coordinate mention above. Probably I need to do bed intersection...

ADD REPLY • link 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

It looks like you modified your question since I answered it. However, you can take the result from this pipeline and do a bedmap --mean operation against a BED file of pre-made windows (say, 10 bases apart), to quickly and efficiently calculate the average GC content over each window.

ADD REPLY • link 10.0 years ago by Alex Reynolds 36k

1

Entering edit mode

10.0 years ago

EagleEye 7.6k

After converting your BAM file into FASTA. This gives the GC percentage and CpG density of each regions in the bed file providing the converted FASTA file as genome input to this tool, http://homer.salk.edu/homer/ngs/annotation.html Note: Use option -CpG.

ADD COMMENT • link 10.0 years ago by EagleEye 7.6k

Ram · Accepted Answer · 2014-12-15

3

Entering edit mode

10.0 years ago

Dan D 7.4k

Here's a python solution. You need to have the pysam library installed (pip install pysam):

import pysam
from sys import argv

bam_file = argv[1]
bed_file = argv[2]

bam = pysam.AlignmentFile(bam_file, 'rb')
with open(bed_file) as bf:
    for line in bf:
        line_parts = line.strip().split()
        chr = line_parts[0]
        start = int(line_parts[1])
        end = int(line_parts[2])
        read_data = bam.fetch(chr, start, end)
        total_bases = 0
        gc_bases = 0
        for read in read_data:
            seq = read.query_sequence
            total_bases += len(seq)
            gc_bases += len([x for x in seq if x == 'C' or x == 'G'])
        if total_bases == 0:
            gc_percent = 'No Reads'
        else:
            gc_percent = '{0:.2f}%'.format(float(gc_bases)/total_bases * 100)
        print '{0}\t{1}'.format(line.strip(), gc_percent)

I didn't understand your comment to my question. If you want to print the total number of GC bases in the reads of a coordinate set, just change the last line to:

print '{0}\t{1}'.format(line.strip(), gc_bases)

You run the program like this:

python [script_name] [bam_file] [bed_file]

It prints the modified BED data to STDOUT.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Dan D 7.4k

0

Entering edit mode

Dear Dan, this is looks very nice, but I have error:

AttributeError: 'module' object has no attribute 'AlignmentFile'

Do you have any idea? Thank you for response..

ADD REPLY • link 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

You need to install the pysam library. You can download it from github and install it manually, or if you have the Python package manager `pip` installed, you can type `pip install pysam` from the command line to install it.

ADD REPLY • link 10.0 years ago by Dan D 7.4k

0

Entering edit mode

Thank you for help - it is installed. But when I try to run I have following error:

Traceback (most recent call last):
  File "cg_content.py", line 10, in <module>
    bam = pysam.AlignmentFile(bam_file.bam, 'rb')
AttributeError: 'module' object has no attribute 'AlignmentFile'

Do you know where could be a problem?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

That's strange. Are you sure it's properly installed? How did you install it? The AlignmentFile() method is central to the pysam library. If it's not available it implies that the installation of the library is somehow incomplete or incorrect.

ADD REPLY • link 10.0 years ago by Dan D 7.4k

0

Entering edit mode

Dear Dan, thank you so much - I did not installed pyrex and Sphinx - right now it looks it work perfectly fine. May I ask - in some intervals there is no reads = 0 - I have at the end error bug:

gc_percent = float(gc_bases)/total_bases * 100
ZeroDivisionError: float division by zero

Is there any solution how to avoid in python to divide by zero? Very very thank you for your time and help!!!

ADD REPLY • link 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

Oops, hahaha. That was clumsy of me. Check my updated answer for the fix!

ADD REPLY • link 10.0 years ago by Dan D 7.4k

0

Entering edit mode

Dan, thank you SOO MUCH - it is now perfect!

ADD REPLY • link 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

Make sure you are using the most current version of pysam

ADD REPLY • link 9.8 years ago by lkmklsmn ▴ 980

0

Entering edit mode

Just wanted to add a link to this website, which can help debugging AlignmentFile() issues:

ADD REPLY • link 8.8 years ago by dvanic ▴ 250

0

Entering edit mode

I am wondering why the GC content is calculated here for a BAM file and not for example from reference genome. Does this imply that if one has 100 samples, she has to calculate the GC content for each sample separately?

ADD REPLY • link 7.7 years ago by Dataman ▴ 380