Does anyone know what is the GC-content of different human chromosomes?
Does anyone know what is the GC-content of different human chromosomes?
**EDIT**
OK, so I felt bad about not actually answering your question, so here you go (generated by the method outlined below):
#Sequence GC content
chr1 0.43
chr2 0.40
chr3 0.40
chr4 0.38
chr5 0.40
chr6 0.40
chr7 0.41
chr8 0.40
chr9 0.43
chr10 0.42
chr11 0.42
chr12 0.41
chr13 0.40
chr14 0.43
chr15 0.44
chr16 0.45
chr17 0.46
chr18 0.40
chr19 0.48
chr20 0.44
chr21 0.43
chr22 0.49
chrX 0.40
chrY 0.46
chrM 0.44
**EDIT ENDS**
The GC content of human chromosomal DNA is very heterogeneous, rendering chromosome-wide statistics relatively meaningless. It has been shown that the human genome is a mosaic of GC-rich and GC-poor regions, of around 300kb in length, called isochores.
You can plot these regions of varying content using the Emboss program isochore
. For example, for chromosome 1.
wget <http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz>
gunzip chr1.fa.gz
isochore -sequence chr1.fa -outfile chr1.isochore -graph png
Gives the following result:
You could also get the sequences of the individual chromosomes and work out their overall GC content, also using Emboss, this time geecee
:
geecee -sequence chr1.fa
Gives us an answer of 43% for Chromosome 1.
hmm, yes agreed, the treatment of 'N' will affect the results considerably. From http://emboss.sourceforge.net/apps/cvs/emboss/apps/geecee.html - 'It sums the number of G and C bases in the input sequence(s) and writes the result to file as the fraction (in the interval 0.0 to 1.0) of the length of the whole sequence.'
GRCh37/hg19/b37:
1 0.417439
2 0.402438
3 0.396943
4 0.382479
5 0.395163
6 0.396109
7 0.407513
8 0.401757
9 0.413168
10 0.415849
11 0.415657
12 0.40812
13 0.385265
14 0.408872
15 0.42201
16 0.447894
17 0.455405
18 0.39785
19 0.483603
20 0.441257
21 0.408325
22 0.479881
X 0.394963
Y 0.391288
MT 0.443626
Done by:
seqtk comp hs37m.fa.gz | awk '/^[0-9MXY]/{x=$4+$5;y=x+$3+$6;print $1"\t"x/y}'
ChrY has lots of ambiguous bases and that is why my result differs most on chrY in comparison to the EMBOSS result. EMBOSS is wrong, IMHO.
I was about to tell you, but then someone crashed the server. Here is how (using Biopieces):
read_fasta -i /home/DATA/downloads/Homo_sapiens/human_hg19.fasta.gz | analyze_gc | write_tab -ck SEQ_NAME,GC% -x
#SEQ_NAME GC%
gi|89161184|ref|AC_000044.1| Homo sapiens chromosome 1, alternate assembly Celera, whole genome shotgun sequence 40.77
using bedtools nuc on hg19 :
1 chr1 0.377295
2 chr2 0.394172
3 chr3 0.390478
4 chr4 0.375491
5 chr5 0.388130
6 chr6 0.387498
7 chr7 0.397821
8 chrX 0.384356
9 chr8 0.392218
10 chr9 0.351521
11 chr10 0.402901
12 chr11 0.403720
13 chr12 0.397843
14 chr13 0.319767
15 chr14 0.336276
16 chr15 0.336248
17 chr16 0.391037
18 chr17 0.436335
19 chr18 0.380423
20 chr20 0.416613
21 chrY 0.172677
22 chr19 0.456450
23 chr22 0.326388
24 chr21 0.297838
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Funny that there are three different GC% answers for Chr1 ...
Depends on the genome build and version that you use - it's perfectly 'legit', as they say in Cockney London slang.
The truth of the matter is that we do not have an honest representation of the true GC content because the reference genome builds exclude / mask telomeric and centromeric regions, where GC content is high.
Thus, all values represented in this thread are based on the genome builds and are not reflective of the actual GC content, which would be larger and which would differ from individual to individual.