Question

How can I get GC content information by R package rentreze

0

Entering edit mode

3.0 years ago

anran04100 • 0

Now I've got the taxid, orgname, accession number, and sembly level as follows

'Hahella chejuensis''GCF_000012985.1''representative genome''Complete Genome'

How can I return the GC content of a gemone by using rentrez? (like entrez_search() function or other)

Thanks

genome entrez_search GC content R rentreze • 1.0k views

ADD COMMENT • link updated 3.0 years ago by Mensur Dlakic ★ 29k • written 3.0 years ago by anran04100 • 0

0

Entering edit mode

It appears that this information is not available via EntrezDirect/erentrez.

It seems to be there in the assembly report for the genome e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/012/985/GCF_000012985.1_ASM1298v1/GCF_000012985.1_ASM1298v1_assembly_stats.txt If you have a few then you could look those up using this way.

ADD REPLY • link 3.0 years ago by GenoMax 149k

score 1 · Answer 1 · 2022-03-15

Here is a little python script that will calculate a combined G/C content for all sequences in a single FASTA file:

import sys
from Bio import SeqIO
from Bio.SeqUtils import GC
import numpy as np

FastaFile = open(sys.argv[1], 'rU')
gc_fraction = np.array([])
seq_total = np.array([])

for rec in SeqIO.parse(FastaFile, 'fasta'):
    seq = rec.seq
    seqLen = len(rec)
    gc_fraction = np.append(gc_fraction, (GC(seq)* seqLen))
    seq_total = np.append(seq_total, seqLen)

print(' G/C-content = %.2f' % float(gc_fraction.sum()/seq_total.sum()) )
FastaFile.close()

I suggest you save it as fasta_gc_mean.py and run like this:

fasta_gc_mean.py GCF_000012985.1_ASM1298v1_genomic.fna

For the genome file listed above it gives the following output:

 G/C-content = 53.87