Calculate GC content for entire chromosome
3
0
Entering edit mode
12 months ago

Hi, I have an assembly I created using bwa-mem. I want to determine the GC percentage for each chromosome itself, not the individual reads. For example, Chr1 has a GC percentage of ___. What is the best way to go about doing this?

Thank you!

bam GC assembly bwa-mem • 1.3k views
ADD COMMENT
3
Entering edit mode
12 months ago

The main one I use is stats.sh from the bbmap package (conda installable).

GC is 0.428 = 43% in this example. It's very quick.

stats.sh MA_organelles1.fasta

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2854  0.2148  0.2137  0.2861  0.0000  0.0000  0.0000  0.4285  0.0273

Main genome scaffold total:             2
Main genome contig total:               2
Main genome scaffold sequence total:    0.710 MB
Main genome contig sequence total:      0.710 MB        0.000% gap
Main genome scaffold N/L50:             1/569.63 KB
Main genome contig N/L50:               1/569.63 KB
Main genome scaffold N/L90:             2/140.384 KB
Main genome contig N/L90:               2/140.384 KB
Max scaffold length:                    569.63 KB
Max contig length:                      569.63 KB
Number of scaffolds > 50 KB:            2
% main genome in scaffolds > 50 KB:     100.00%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                      2               2         710,014         710,014   100.00%
 100 KB                      2               2         710,014         710,014   100.00%
 250 KB                      1               1         569,630         569,630   100.00%
 500 KB                      1               1         569,630         569,630   100.00%

There is also for finding divergent GC regions seqtk gc

https://github.com/lh3/seqtk

    seqtk gc

Usage: seqtk gc [options] <in.fa>
Options:
  -w         identify high-AT regions
  -f FLOAT   min GC fraction (or AT fraction for -w) [0.60]
  -l INT     min region length to output [20]
  -x FLOAT   X-dropoff [10.0]
ADD COMMENT
0
Entering edit mode

What does this do? It's not documented well and does not work for me in terms of, it produces no output:

cat foo.fa
>chr1
GCGCGCGCGATTATTATATATA

seqtk gc foo.fa
(empty...)
ADD REPLY
1
Entering edit mode

Try tripling your example seq length, see 20bp min region

  -l INT     min region length to output [20]

To answer your question, it picks out divergent GC regions over default 60% or fraction 0.60 with a mi length of 20bp, and produces a bed file of those regions.

So actually not what the question was I'm afraid.

ADD REPLY
1
Entering edit mode
12 months ago
ben@f ▴ 20

If you're comfortable using Python, I've created a script that calculates the GC content and GC-skew for each contig, scaffold, or chromosome in a fasta file. This is specifically designed for generating data for a circos plot.

To use the script, make sure you have Biopython installed in your conda environment. Once that's set up, here's how the script functions:

Run the script in your terminal using the command:

python3 gc_content_skew-optimized.py -f file.fasta

You can adjust the window size (-w) and step size (-s) according to your preferences. the default was set to 1000bp

#!/usr/bin/python

## this script is to generate GC content
import sys
from Bio import SeqIO
from argparse import ArgumentParser

def gc_content_window(s):
    gc = sum(s.count(x) for x in ['G', 'C', 'g', 'c', 'S', 's'])
    return round(gc / len(s), 4)

def gc_skew_window(s):
    g, c = s.lower().count('g'), s.lower().count('c')
    return round((g - c) / (g + c), 4) if g + c > 0 else 0

def main():
    parser = ArgumentParser(description="Calculate GC content and GC skew of input Fasta sequence")
    parser.add_argument("-f", "--file", dest="filename", help="Input Fasta format file", metavar="FASTA", required=True)
    parser.add_argument("-w", "--window", dest="WindowSize", help="WindowSize to calculate (default: 1000)", default=1000, type=int)
    parser.add_argument("-s", "--step", dest="StepSize", help="StepSize for slide widows (default: 1000)", default=1000, type=int)
    args = parser.parse_args()

    with open("gc_content.tsv", "w") as gc_content_file, open("gc_skew.tsv", "w") as gc_skew_file:
        for record in SeqIO.parse(args.filename, 'fasta'):
            seq = record.seq
            for i in range(0, len(seq), args.StepSize):
                subseq = seq[i:i + args.WindowSize]
                start, end = i + 1, min(i + args.StepSize, len(seq))
                gc_content_file.write(f"{record.id}\t{start}\t{end}\t{gc_content_window(subseq)}\n")
                gc_skew_file.write(f"{record.id}\t{start}\t{end}\t{gc_skew_window(subseq)}\n")

if __name__ == '__main__':
    main()

Hopefully, this helps

ADD COMMENT
1

Login before adding your answer.

Traffic: 1597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6