Conversion of full gene description to gene symbols
2
2
Entering edit mode
6.1 years ago

Hi, I have gone through the Biostars web page. I could not find the answer for my question. I am trying to convert 2000 gene descriptions (full name of genes) to gene symbols (acronyms). As I am working with a non-model species, Gene ID did not help much. I also tried bioDBnet and David. But not much help either. Is there any other online or offline tools I can use? thanks in advance.

gene sequence • 4.7k views
ADD COMMENT
0
Entering edit mode

Can you give some examples? And which organism is this?

ADD REPLY
0
Entering edit mode

For example, I have "full gene name" called "sulfatase 1". Gene symbol for that is SULF1. I have 2000 such full names. Organism I am working on is salmon.

ADD REPLY
2
Entering edit mode
6.1 years ago
GenoMax 147k

You could use NCBI unix utilities. With the example you posted above (assuming you are working with Atlantic Salmon:

$ esearch -db gene -query "sulfatase 1 [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name"
        <Name>sulf1</Name>

$ esearch -db gene -query "monooxygenase [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name"
        <Name>ywhaz</Name>
        <Name>coq6</Name>
        <Name>ywhah</Name>
        <Name>fmo5</Name>
        <Name>moxd1</Name>
        <Name>agmo</Name>
        <Name>coq6</Name>
        <Name>msmo1</Name>
        <Name>pam</Name>
        <Name>bcmo1</Name>

As long as the titles you have are specific they should result in a single name. Otherwise you may get more than one gene (example #2 above).

ADD COMMENT
0
Entering edit mode

Hi again, thanks for the suggestion. When I tried to run this in loop for the list of 1900 genes it didnot work the way I wanted.

#!/bin/bash
cat /home/softwares/genelist.txt |
while read line
do
   esearch -db gene -query "$line [TITLE] AND Salmo salar [ORGN]" | esummary | grep -w "Name" 2>&1 | tee log-gene.txt
done

it gave me gene symbol for first line only. But if i remove TITLE and ORGN information it seems like working. what do you think wrong here?

ADD REPLY
1
Entering edit mode

Please use code format for code, I have changed it now for you.

ADD REPLY
0
Entering edit mode

If you provide additional examples I can take a look.

Things to check is the example is for "Atlantic Salmon". If you are working with a different species you need to replace the relevant latin name in the command. You should also add a sleep step in your loop so NCBI does not flag your IP. You will also want to sign up for NCBI API keys's if you are doing that many queries.

ADD REPLY
0
Entering edit mode

thanks again.

here is the small example:

solute carrier family 28 member 3 
ran-binding protein 3 
DNA polymerase zeta catalytic subunit 
dynein heavy chain 8, axonemal 
activin A receptor type 1 
regulator of nonsense transcripts 1 
target of EGR1, member 1 (nuclear) 
methyltransferase like 9 
natural resistance-associated macrophage protein 2 
lymphoid-restricted membrane protein 
transmembrane protein 39B 
spindlin-Z

let me know if this list helps?

ADD REPLY
2
Entering edit mode

Try the following:

$ while read i; do echo $i; esearch -db gene -query "$i [TITLE] AND Salmo salar [ORGN]" < /dev/null | esummary | grep -w "Name"; done < list
solute carrier family 28 member 3
        <Name>slc28a3</Name>
ran-binding protein 3
        <Name>ranb3</Name>
        <Name>LOC106600517</Name>
        <Name>LOC106573555</Name>
        <Name>LOC106569622</Name>
        <Name>LOC106568171</Name>
        <Name>LOC106563114</Name>
DNA polymerase zeta catalytic subunit
dynein heavy chain 8, axonemal
activin A receptor type 1
        <Name>acvr1</Name>
regulator of nonsense transcripts 1
        <Name>LOC106613627</Name>
        <Name>LOC106599024</Name>
        <Name>LOC106573910</Name>
        <Name>LOC106568944</Name>
target of EGR1, member 1 (nuclear)
methyltransferase like 9
        <Name>mettl9</Name>
natural resistance-associated macrophage protein 2
        <Name>LOC106583265</Name>
        <Name>LOC106572643</Name>
        <Name>LOC106567082</Name>
        <Name>LOC106565428</Name>
lymphoid-restricted membrane protein
        <Name>LOC106609163</Name>
        <Name>LOC106576138</Name>
transmembrane protein 39B
        <Name>tmem39b</Name>
spindlin-Z
        <Name>spinz</Name>

Thanks to @RamRS for the pointer that was essential.

ADD REPLY
2
Entering edit mode

Addendum (also something I learned today, thank you @GenoMax):

while IFS= read line
do
    # commands here
done < in_file

consumes the entire file and allows the #commands here command to read all of it, whereas

for line in $(cat in_file)
do
    # commands here
done

makes the shell split the file by white space and feeds the content chunk by chunk to the #commands here commands.

If you need for to split only by new line, use:

OLD_IFS=$IFS
IFS=$'\n'
for line in $(cat in_file)
do
     #commands here
done
IFS=$OLD_IFS
unset OLD_IFS

If you don't wish to use a temporary variable to store $IFS, you can check out other options here: https://unix.stackexchange.com/a/92190/135331

ADD REPLY
0
Entering edit mode
6.1 years ago
Anima Mundi ★ 2.9k

Hello, in your shoes I probably would:

a) retrieve all NCBI sequences for your taxon in GenBank format

b) search for each gene description in your GenBank database (e.g. in the "Official Full Name" field)

c) fetch the corresponding gene symbol (i.e. in the "Official Symbol" field)

Hope this helps!

ADD COMMENT

Login before adding your answer.

Traffic: 1691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6