How can I get GC content information by R package rentreze
1
0
Entering edit mode
2.7 years ago
anran04100 • 0

Now I've got the taxid, orgname, accession number, and sembly level as follows

'Hahella chejuensis''GCF_000012985.1''representative genome''Complete Genome'

How can I return the GC content of a gemone by using rentrez? (like entrez_search() function or other)

Thanks

genome entrez_search GC content R rentreze • 890 views
ADD COMMENT
0
Entering edit mode

It appears that this information is not available via EntrezDirect/erentrez.

It seems to be there in the assembly report for the genome e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/012/985/GCF_000012985.1_ASM1298v1/GCF_000012985.1_ASM1298v1_assembly_stats.txt If you have a few then you could look those up using this way.

ADD REPLY
1
Entering edit mode
2.7 years ago
Mensur Dlakic ★ 28k

Here is a little python script that will calculate a combined G/C content for all sequences in a single FASTA file:

import sys
from Bio import SeqIO
from Bio.SeqUtils import GC
import numpy as np

FastaFile = open(sys.argv[1], 'rU')
gc_fraction = np.array([])
seq_total = np.array([])

for rec in SeqIO.parse(FastaFile, 'fasta'):
    seq = rec.seq
    seqLen = len(rec)
    gc_fraction = np.append(gc_fraction, (GC(seq)* seqLen))
    seq_total = np.append(seq_total, seqLen)

print(' G/C-content = %.2f' % float(gc_fraction.sum()/seq_total.sum()) )
FastaFile.close()

I suggest you save it as fasta_gc_mean.py and run like this:

fasta_gc_mean.py GCF_000012985.1_ASM1298v1_genomic.fna

For the genome file listed above it gives the following output:

 G/C-content = 53.87
ADD COMMENT

Login before adding your answer.

Traffic: 2681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6