Hi, I have an assembly I created using bwa-mem. I want to determine the GC percentage for each chromosome itself, not the individual reads. For example, Chr1 has a GC percentage of ___. What is the best way to go about doing this?
Thank you!
Hi, I have an assembly I created using bwa-mem. I want to determine the GC percentage for each chromosome itself, not the individual reads. For example, Chr1 has a GC percentage of ___. What is the best way to go about doing this?
Thank you!
The main one I use is stats.sh from the bbmap package (conda installable).
GC is 0.428 = 43% in this example. It's very quick.
stats.sh MA_organelles1.fasta
A C G T N IUPAC Other GC GC_stdev
0.2854 0.2148 0.2137 0.2861 0.0000 0.0000 0.0000 0.4285 0.0273
Main genome scaffold total: 2
Main genome contig total: 2
Main genome scaffold sequence total: 0.710 MB
Main genome contig sequence total: 0.710 MB 0.000% gap
Main genome scaffold N/L50: 1/569.63 KB
Main genome contig N/L50: 1/569.63 KB
Main genome scaffold N/L90: 2/140.384 KB
Main genome contig N/L90: 2/140.384 KB
Max scaffold length: 569.63 KB
Max contig length: 569.63 KB
Number of scaffolds > 50 KB: 2
% main genome in scaffolds > 50 KB: 100.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 2 2 710,014 710,014 100.00%
100 KB 2 2 710,014 710,014 100.00%
250 KB 1 1 569,630 569,630 100.00%
500 KB 1 1 569,630 569,630 100.00%
There is also for finding divergent GC regions seqtk gc
seqtk gc
Usage: seqtk gc [options] <in.fa>
Options:
-w identify high-AT regions
-f FLOAT min GC fraction (or AT fraction for -w) [0.60]
-l INT min region length to output [20]
-x FLOAT X-dropoff [10.0]
If you're comfortable using Python, I've created a script that calculates the GC content and GC-skew for each contig, scaffold, or chromosome in a fasta file. This is specifically designed for generating data for a circos plot.
To use the script, make sure you have Biopython installed in your conda environment. Once that's set up, here's how the script functions:
Run the script in your terminal using the command:
python3 gc_content_skew-optimized.py -f file.fasta
You can adjust the window size (-w
) and step size (-s
) according to your preferences. the default was set to 1000bp
#!/usr/bin/python
## this script is to generate GC content
import sys
from Bio import SeqIO
from argparse import ArgumentParser
def gc_content_window(s):
gc = sum(s.count(x) for x in ['G', 'C', 'g', 'c', 'S', 's'])
return round(gc / len(s), 4)
def gc_skew_window(s):
g, c = s.lower().count('g'), s.lower().count('c')
return round((g - c) / (g + c), 4) if g + c > 0 else 0
def main():
parser = ArgumentParser(description="Calculate GC content and GC skew of input Fasta sequence")
parser.add_argument("-f", "--file", dest="filename", help="Input Fasta format file", metavar="FASTA", required=True)
parser.add_argument("-w", "--window", dest="WindowSize", help="WindowSize to calculate (default: 1000)", default=1000, type=int)
parser.add_argument("-s", "--step", dest="StepSize", help="StepSize for slide widows (default: 1000)", default=1000, type=int)
args = parser.parse_args()
with open("gc_content.tsv", "w") as gc_content_file, open("gc_skew.tsv", "w") as gc_skew_file:
for record in SeqIO.parse(args.filename, 'fasta'):
seq = record.seq
for i in range(0, len(seq), args.StepSize):
subseq = seq[i:i + args.WindowSize]
start, end = i + 1, min(i + args.StepSize, len(seq))
gc_content_file.write(f"{record.id}\t{start}\t{end}\t{gc_content_window(subseq)}\n")
gc_skew_file.write(f"{record.id}\t{start}\t{end}\t{gc_skew_window(subseq)}\n")
if __name__ == '__main__':
main()
Hopefully, this helps
How calculate CG content for each contig in an transcript assembly?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What does this do? It's not documented well and does not work for me in terms of, it produces no output:
Try tripling your example seq length, see 20bp min region
To answer your question, it picks out divergent GC regions over default 60% or fraction 0.60 with a mi length of 20bp, and produces a bed file of those regions.
So actually not what the question was I'm afraid.