GC percent detection for several number of genes
6
0
Entering edit mode
7.7 years ago

Dear all,

I know that my question may seem so basic, but I need your advice.

I have about 400 genes and I need to know their GC%.

I couldn't find any online program that accepts a list of genes and returns their corresponding GC percent.

Can anybody suggest a way to do this?

I will appreciate any help in advance

Nazanin

GC% large number of genes • 5.5k views
ADD COMMENT
1
Entering edit mode

Does the %GC of a gene only include the exons, or also the UTRs and/or introns?

ADD REPLY
0
Entering edit mode

I have only the gene symbols of those 400 genes.

ADD REPLY
0
Entering edit mode

Does the desired %GC of a gene only include the exonic sequences, or also the sequences from UTRs and/or introns?

ADD REPLY
0
Entering edit mode

If you have any nucleotide sequence, one could just use geneboy, and not worry about if it's exonic, UTR, promoter or other. https://www.dnalc.org/resources/geneboy.html

ADD REPLY
0
Entering edit mode

Hi , You need to describe your data , do you have sequences of your genes ?

Best

ADD REPLY
0
Entering edit mode

no. I only have their names (gene symbol)

ADD REPLY
1
Entering edit mode

Ok , i think you need sequences. You can get it with gene symbol using https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=585799715_RKJMr531TPC0GMrB2Q4uMigVdh3K . You need make a custom track to get bed format. Then you can use samtools with the reference genome to get GC content :

bedtools nuc -fi hg19.fa -bed SNP_Regions.bed

ADD REPLY
1
Entering edit mode

That command says bedtools, not samtools :p

ADD REPLY
0
Entering edit mode

Thanks to correct me :)

ADD REPLY
0
Entering edit mode

Unfortunately the gene symbols belong to a bacterium not human.

Do you think I can use Ensembl bacteria?

ADD REPLY
2
Entering edit mode

Which bacterium is this? You may be able to get the records you need from the protein sequences from the genome and then it could simply be a matter of extracting genes you need and using a tool like GeeCee from EMBOSS to calculate the GC%.

ADD REPLY
0
Entering edit mode

I looked this BioMart for Bacteria Ensembl and if it's right there is no biomart to request what you want ... May be in NCBI ? How did you get your list ? is it for a specific species ?

ADD REPLY
2
Entering edit mode
7.7 years ago
GenoMax 148k

Since you are restricted to using windows here is a workable solution. As I started writing this I realized that if you are unfamilar with the command line (windows or otherwise) this is going to be tough. Some of these links may not work from your country and there is nothing we can do about that (you probably know what to do). But here goes.

  1. Get the sequence of Pesudomonas genes from here.
  2. Download BBMap suite from here. No installation is required. You will need to use 7-zip to uncompress the tar.gz file until you get a folder that says "bbmap". Hopefully you have Java on your machine since you will need it.If not install that.
    Save both of these downloads in the same directory.
  3. Open windows terminal and navigate to the folder where BBMap and the downloaded sequence file is present.
  4. We are going to run filterbyname program (/path_to/ part will need to be adjusted or removed if you have bbmap folder and the sequence file in current directory). java -cp /path/to/bbmap/current/driver/FilterReadsByName in=Pseudomonas_aeruginosa_PAO1_107.ffn names=your_identifiers include=t out=filtered.fa your_identifiers should be a plain text file with one gene identifier per line. This will get you ~110 sequences from your list that start with PA* in filtered.fa file.
  5. Use the web interface to Geecee program from EMBOSS at this site to get the GC%.

Rest of the sequences are in the file you downloaded but since their identifiers are buried in the fasta headers you could open the sequence file up, search for the names and copy and paste the sequence you need out into a new file and the repeat #5 above.

Option 2: If you can get python installed on your windows machine then you could potentially use the python code you received (or one from @jrj.healey in this thread) and just run it against the entire file from #1. You can then weed out the genes you need from that list.

ADD COMMENT
2
Entering edit mode
7.7 years ago

Dear all,

Finally I could calculate the GC% of all 320 genes.

Guess how?

There is an online tool (http://www.bioinformatics.nl/cgi-bin/emboss/infoseq) that you can upload a file that contains as many as fasta sequences. In a few seconds it calculates the GC% of all genes.

Thank you any way for your comments

Nazanin

ADD COMMENT
1
Entering edit mode
7.7 years ago
EagleEye 7.6k

1) Use biomart web to extract FASTA format sequence for your gene symbols,

http://www.ensembl.org/biomart/martview/c365a09ee6f427fb6b1c946a17fb1a72

Tutorial: https://www.ebi.ac.uk/training/online/course/ensembl-browsing-chordate-genomes/download-data-biomart

Note: From BioMart you can also get percentage GC content directly.

2) Check out this perl script. It works perfectly.

http://alrlab.research.pdx.edu/aquificales/scripts/get_gc_content.pl

ADD COMMENT
1
Entering edit mode

Just to clarify: Perl for windows will need to be installed for this perl script to work.

ADD REPLY
0
Entering edit mode

Forgot to mention.. thanks.

ADD REPLY
0
Entering edit mode
7.7 years ago
theobroma22 ★ 1.2k

Can you use R? The oligo package can do this for you, among many other ways. Also, in Excel.

ADD COMMENT
0
Entering edit mode

Does the oligo package start from gene names to get sequences and GC content?

ADD REPLY
0
Entering edit mode

OP said he has 400 "genes" and needs to calculate gc content, so yes the oligo package can be employed. OP didn't say he needed the sequences. My bad!

ADD REPLY
0
Entering edit mode
7.7 years ago
Joe 21k

Partial solution for you.

I don't know offhand how you can go about getting sequences for all your gene identifiers (I'm sure there's a way through NCBI though), but here's a script that'll calculate GC content (among other things) once you have all the nucleotide fastas:

#!/usr/bin/python

import os
import sys

with open(sys.argv[1],'r') as inFile:

    totalBP = 0
    gcBP = 0
    headerCount = 0

    for line in inFile:
        if line[0] == ">":
            headerCount += 1
         else:
            seqLine = line.strip().lower()
            totalBP += len(seqLine)
            gcBP += seqLine.count('g')
            gcBP += seqLine.count('c')

    print ("File %s" % str(sys.argv[1])) + ' GC: ' + str(float(gcBP) / totalBP)
ADD COMMENT
0
Entering edit mode
7.7 years ago
biomaster ▴ 180

You can send me the gene list or upload it somewhere, then I can do the calculation for you.

ADD COMMENT
0
Entering edit mode

This is not an appropriate answer for what the OP is asking. You may want to delete this.

Question is simply about a set of gene identifiers (no NGS/alignments involved as far as I see) for which OP wants to calculate GC%.

ADD REPLY
0
Entering edit mode

Sorry I was wrong, his question is simpler.

ADD REPLY
0
Entering edit mode

Thank u so much.

Here is the list of my genes( exactly 320).

ACCD ADK ALAS ALGZ AMIA AMTB AROB ATPA ATPC ATPD ATPE ATPF ATPH BACA CDHB CLPP CLS CMK COAE CPG2 CYOE CYSS DDLA DEF DGCA DNAA DNAE DNAG DNAX EFP ENO ERAS FABD FABF1 FABG FABZ FEPB FFH FOLA FOLC FOLK FOLP FRR FRUI FTSK FTSY GATA GATB GCP GCVT1 GIDA GIDB GLTX GLYA3 GMK GUAA GUAB GYRB HEMA HEME HEMK HEML HISS ILES INFA INFB ISPA ISPB KDPA KDPB KDPC KDPD KDPE KSGA LEPA LEUS LIG LIPC LPTA LYSA LYSS MAP METG MGTE MIAA MICA MIGA MODC MOEB MREC MTLR MURA NADD NRDA NUSA NUSB NUSG OSPR PA0017 PA0326 PA0372 PA0379 PA0386 PA0392 PA0394 PA0419 PA0450 PA0480 PA0489 PA0559 PA0733 PA0806 PA0832 PA0841 PA0842 PA0954 PA1007 PA1052 PA1055 PA1056 PA1059 PA1139 PA1164 PA1170 PA1303 PA1361 PA1386 PA1497 PA1507 PA1629 PA1653 PA1824 PA1964 PA1993 PA2098 PA2108 PA2157 PA2179 PA2526 PA2535 PA2613 PA2701 PA2751 PA2806 PA2822 PA2870 PA2914 PA2922 PA2959 PA2984 PA3048 PA3066 PA3086 PA3088 PA3139 PA3140 PA3175 PA3179 PA3197 PA3312 PA3470 PA3517 PA3567 PA3638 PA3641 PA3685 PA3741 PA3783 PA3799 PA3804 PA3806 PA3822 PA3854 PA3980 PA3982 PA4005 PA4068 PA4097 PA4122 PA4128 PA4153 PA4167 PA4178 PA4286 PA4672 PA4673 PA4746 PA4835 PA4886 PA4908 PA4950 PA4962 PA5127 PA5134 PA5195 PA5201 PA5228 PA5248 PA5281 PA5333 PA5384 PA5439 PA5473 PA5484 PA5529 PANB PANC PAND PBPC PCNB PELF PEPA PGK PGM PHES PHET PHNC PHOA PHOR PHOU PHR PNP POLA POTB POTC PPID PRFA PRFB PRFC PROS PRS PSTA PSTB PTPA PURF PURM PYKF PYRD PYRG QUEA RADA RECD RECF RECJ RECQ RECR RIMI RLUD RNC RNPA RPE RPIA RPLA RPLB RPLC RPLD RPLE RPLF RPLJ RPLK RPLL RPLM RPLN RPLO RPLP RPLQ RPLR RPLS RPLT RPLV RPLW RPLX RPMA RPMC RPMD RPMH RPMJ RPOA RPOB RPOC RPOD RPSA RPSD RPSE RPSH RPSI RPSJ RPSK RPSL RPSM RPSO RPSP RPSS RPST RPSU RUVA RUVB SECA SECY SERS STP1 SUCC TAG TGT THID THIG THRS THYA TIG TKTA TOPA TRPS TRXB1 TSF TYRS UPPS VALS XERD ZNUB

Thank u in advance to do this calculation for me

Nazanin

ADD REPLY
0
Entering edit mode

Hi,

If it is not too time consuming can you calculate the GC% of the following gene symbols:

Here is the list of my genes( exactly 320).

ACCD ADK ALAS ALGZ AMIA AMTB AROB ATPA ATPC ATPD ATPE ATPF ATPH BACA CDHB CLPP CLS CMK COAE CPG2 CYOE CYSS DDLA DEF DGCA DNAA DNAE DNAG DNAX EFP ENO ERAS FABD FABF1 FABG FABZ FEPB FFH FOLA FOLC FOLK FOLP FRR FRUI FTSK FTSY GATA GATB GCP GCVT1 GIDA GIDB GLTX GLYA3 GMK GUAA GUAB GYRB HEMA HEME HEMK HEML HISS ILES INFA INFB ISPA ISPB KDPA KDPB KDPC KDPD KDPE KSGA LEPA LEUS LIG LIPC LPTA LYSA LYSS MAP METG MGTE MIAA MICA MIGA MODC MOEB MREC MTLR MURA NADD NRDA NUSA NUSB NUSG OSPR PA0017 PA0326 PA0372 PA0379 PA0386 PA0392 PA0394 PA0419 PA0450 PA0480 PA0489 PA0559 PA0733 PA0806 PA0832 PA0841 PA0842 PA0954 PA1007 PA1052 PA1055 PA1056 PA1059 PA1139 PA1164 PA1170 PA1303 PA1361 PA1386 PA1497 PA1507 PA1629 PA1653 PA1824 PA1964 PA1993 PA2098 PA2108 PA2157 PA2179 PA2526 PA2535 PA2613 PA2701 PA2751 PA2806 PA2822 PA2870 PA2914 PA2922 PA2959 PA2984 PA3048 PA3066 PA3086 PA3088 PA3139 PA3140 PA3175 PA3179 PA3197 PA3312 PA3470 PA3517 PA3567 PA3638 PA3641 PA3685 PA3741 PA3783 PA3799 PA3804 PA3806 PA3822 PA3854 PA3980 PA3982 PA4005 PA4068 PA4097 PA4122 PA4128 PA4153 PA4167 PA4178 PA4286 PA4672 PA4673 PA4746 PA4835 PA4886 PA4908 PA4950 PA4962 PA5127 PA5134 PA5195 PA5201 PA5228 PA5248 PA5281 PA5333 PA5384 PA5439 PA5473 PA5484 PA5529 PANB PANC PAND PBPC PCNB PELF PEPA PGK PGM PHES PHET PHNC PHOA PHOR PHOU PHR PNP POLA POTB POTC PPID PRFA PRFB PRFC PROS PRS PSTA PSTB PTPA PURF PURM PYKF PYRD PYRG QUEA RADA RECD RECF RECJ RECQ RECR RIMI RLUD RNC RNPA RPE RPIA RPLA RPLB RPLC RPLD RPLE RPLF RPLJ RPLK RPLL RPLM RPLN RPLO RPLP RPLQ RPLR RPLS RPLT RPLV RPLW RPLX RPMA RPMC RPMD RPMH RPMJ RPOA RPOB RPOC RPOD RPSA RPSD RPSE RPSH RPSI RPSJ RPSK RPSL RPSM RPSO RPSP RPSS RPST RPSU RUVA RUVB SECA SECY SERS STP1 SUCC TAG TGT THID THIG THRS THYA TIG TKTA TOPA TRPS TRXB1 TSF TYRS UPPS VALS XERD ZNUB

Thank u in advance

Nazanin

ADD REPLY
3
Entering edit mode

So you prefer someone else to do it for you, rather than to learn something yourself with all the advice you got in this thread?

ADD REPLY
0
Entering edit mode

Biomaster suggest me to do it for me.

He also sent me the Python code, but unfortunately I am not familiar with Python.

By the way I could not find any practical solution in this thread.

Getting the sequences of all these 320 genes is so time consuming and if I want to do this, I also can calculate the GC% one by one by available software.

I am eager to learn the solution if you have any suggestion

Nazanin

ADD REPLY
0
Entering edit mode

You should start to answer the question from genomax2 and me which was : Which bacterium ? ...

ADD REPLY
0
Entering edit mode

The name of the bacterium is: Pseudomonas aeruginosa

ADD REPLY
1
Entering edit mode

I was looking for a simple solution online but i didn't find one. If you can code in some language the solution could be : 1 go to http://www.pseudomonas.com/strain/download and get Annotations and Sequence (GFF3) file and the full genome of Pseudomonas aeruginosa 2 Get your coordinate of yours genes and make a bed file 3 Get you %GC : bedtools nuc -fi genome_ref -bed your_bed_created_in_stape_2.bed

ADD REPLY
1
Entering edit mode

There is not a lot of coding involved in that ;-)

  1. Convert gff3 to bed
  2. Grep the lines you need (grep -w -f mygenes.txt myannotation.bed > myfilteredannotation.bed)
  3. bedtools
ADD REPLY
0
Entering edit mode

You also hadn't replied if you only need the GC of the coding sequences, or also of the intronic part. But since we now know it's a bacterium there is no need for that information either and the problem is far easier. Next time, be as informative as possible in your first post. Don't forget the organism.

You need a bed file of your genes of interest and you need the reference genome in fasta format. Is that something you can fix?

ADD REPLY
0
Entering edit mode

Since I want to relate the level of expression of the genes to their GC%, I think the GC% of coding regions would be more helpful. Unfortunately I have no access to Linux operating system and I do not think I can perform the procedure.

Thank you anyway for the help and time you put for my problem

ADD REPLY

Login before adding your answer.

Traffic: 1757 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6