how to calculate a region‘ GC content in reference genome??thanks !! So I only have the region.
how to calculate a region‘ GC content in reference genome??thanks !! So I only have the region.
If you have your sequences in a FASTA file, you can use awk
to calculate GC content. The following will calculate GC content over all input sequences, and ignore ambiguous bases (like N
):
$ awk ' \
BEGIN { \
FS=""; \
cg=0; \
t=0; \
} \
{ \
if ($1 != ">") { \
for (i = 1; i <= NF; i++) { \
if ($i ~ /[ACTGactg]/) { \
t++;
} \
if ($i ~ /[CGcg]/) { \
cg++;
} \
} \
} \
} \
END { \
print cg"\t"t"\t"(cg/t); \
}' foo.fa
For example:
$ more foo.fa
>ctcf
CCGCGNGGNGGCAG
$ awk '...' foo.fa
11 12 0.916667
If you need GC content per-FASTA element, the END
block's code can be copied into the main loop:
$ awk ' \
BEGIN { \
FS=""; \
cg=0; \
t=0; \
} \
{ \
if ($1 != ">") { \
for (i = 1; i <= NF; i++) { \
if ($i ~ /[ACTGactg]/) { \
t++;
} \
if ($i ~ /[CGcg]/) { \
cg++;
} \
} \
} \
else { \
if (t > 0) { \
print h"\t"cg"\t"t"\t"(cg/t); \
cg = 0; \
t = 0; \
} \
h = substr($0,2); \
} \
} \
END { \
print h"\t"cg"\t"t"\t"(cg/t); \
}' foo.fa
The output from this includes each element's header:
$ awk '...' foo.fa
ctcf 11 12 0.916667
To get a FASTA file, if your organism-of-interest has reference data on the UCSC Genome Browser, you can obtain sequence data for your regions-of-interest through the use of a custom track.
First, create a BED file containing your regions-of-interest and upload that file to a custom track via My Data > Custom Tracks
. Then ask the browser to return the sequence data associated with these custom regions:
Tools > Table Browser
Custom Tracks
from the group
pull-down menutable
menuregion
radio button to genome
output format
to sequence
get output
get sequence
Save the resulting FASTA file to a place where you can find it. Then run it through one of the two awk
scripts above.
Please refer to http://www.biologicscorp.com/tools/GCContent which plots using google chart and it is easy to know how to calculate.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you have the sequence data already, or do you need that as well? What have you tried so far?