How can I get GC content information by R package rentreze
1
Now I've got the taxid, orgname, accession number, and sembly level as follows
'Hahella chejuensis''GCF_000012985.1''representative genome''Complete Genome'
How can I return the GC content of a gemone by using rentrez? (like entrez_search() function or other)
Thanks
genome
entrez_search
GC
content
R
rentreze
• 890 views
Here is a little python script that will calculate a combined G/C content for all sequences in a single FASTA file:
import sys
from Bio import SeqIO
from Bio.SeqUtils import GC
import numpy as np
FastaFile = open(sys.argv[1], 'rU')
gc_fraction = np.array([])
seq_total = np.array([])
for rec in SeqIO.parse(FastaFile, 'fasta'):
seq = rec.seq
seqLen = len(rec)
gc_fraction = np.append(gc_fraction, (GC(seq)* seqLen))
seq_total = np.append(seq_total, seqLen)
print(' G/C-content = %.2f' % float(gc_fraction.sum()/seq_total.sum()) )
FastaFile.close()
I suggest you save it as fasta_gc_mean.py
and run like this:
fasta_gc_mean.py GCF_000012985.1_ASM1298v1_genomic.fna
For the genome file listed above it gives the following output:
G/C-content = 53.87
Login before adding your answer.
Traffic: 2681 users visited in the last hour
It appears that this information is not available via EntrezDirect/erentrez.
It seems to be there in the assembly report for the genome e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/012/985/GCF_000012985.1_ASM1298v1/GCF_000012985.1_ASM1298v1_assembly_stats.txt If you have a few then you could look those up using this way.