Question

How To Calculate A Region‘ Gc Content In Reference Genome？？

1

Entering edit mode

12.0 years ago

siyu ▴ 150

how to calculate a region‘ GC content in reference genome？？thanks !! So I only have the region.

gc • 12k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 12.0 years ago by siyu ▴ 150

1

Entering edit mode

Do you have the sequence data already, or do you need that as well? What have you tried so far?

ADD REPLY • link 12.0 years ago by Alex Reynolds 36k

score 8 · Answer 1 · 2013-04-27

If you have your sequences in a FASTA file, you can use awk to calculate GC content. The following will calculate GC content over all input sequences, and ignore ambiguous bases (like N):

$ awk ' \
BEGIN { \
    FS=""; \
    cg=0; \
    t=0; \
} \
{ \
    if ($1 != ">") { \
        for (i = 1; i <= NF; i++) { \
            if ($i ~ /[ACTGactg]/) { \
                t++;
            } \
            if ($i ~ /[CGcg]/) { \
                cg++;
            } \
        } \
    } \
} \
END { \
    print cg"\t"t"\t"(cg/t); \
}' foo.fa

For example:

$ more foo.fa
>ctcf
CCGCGNGGNGGCAG

$ awk '...' foo.fa
11  12  0.916667

If you need GC content per-FASTA element, the END block's code can be copied into the main loop:

$ awk ' \
BEGIN { \
    FS=""; \
    cg=0; \
    t=0; \
} \
{ \
    if ($1 != ">") { \
        for (i = 1; i <= NF; i++) { \
            if ($i ~ /[ACTGactg]/) { \
                t++;
            } \
            if ($i ~ /[CGcg]/) { \
                cg++;
            } \
        } \
    } \
    else { \
        if (t > 0) { \
            print h"\t"cg"\t"t"\t"(cg/t); \
            cg = 0; \
            t = 0; \
        } \
        h = substr($0,2); \
    } \
} \
END { \
    print h"\t"cg"\t"t"\t"(cg/t); \
}' foo.fa

The output from this includes each element's header:

$ awk '...' foo.fa
ctcf    11  12  0.916667

To get a FASTA file, if your organism-of-interest has reference data on the UCSC Genome Browser, you can obtain sequence data for your regions-of-interest through the use of a custom track.

First, create a BED file containing your regions-of-interest and upload that file to a custom track via My Data > Custom Tracks. Then ask the browser to return the sequence data associated with these custom regions:

Select Tools > Table Browser
Select Custom Tracks from the group pull-down menu
Select your custom track from the table menu
Set the region radio button to genome
Set the output format to sequence
Click get output
Set any post-processing options and click get sequence

Save the resulting FASTA file to a place where you can find it. Then run it through one of the two awk scripts above.

Ram · Answer 2 · 2014-07-22

0

Entering edit mode

10.8 years ago

bioee • 0

determine the slide window size
calculate the gc content for each window size

Please refer to http://www.biologicscorp.com/tools/GCContent which plots using google chart and it is easy to know how to calculate.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by bioee • 0