Perl script to download gene sequences
1
0
Entering edit mode
6.5 years ago
dllopezr ▴ 130

Hi everyone!

This entrez command give me an outpout with the genes sequences of a gen/enzymes for various organism:

esearch -db gene -query "glutaminase-asparaginase [Gene/Protein Name] AND (bacteria [orgn] OR fungi [orgn] OR archaea [orgn]) AND alive [prop]" | efetch -format docsum | xtract -pattern GenomicInfoType -element ChrAccVer -element ChrStart -element ChrStop |xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta'

The output is similiar to:

>NC_030957.1:c4121890-4120582 Colletotrichum higginsianum
TGAGAGCTTCTTACTTGTCGACGCTGTTGTTGCCAGCTCTGGTAGCCCATGGTTTCGCCTCCCCAGTCGG
>NC_016603.1:c898826-897759 Acinetobacter pittii
TGTTGACTAAAACTGTTAAATCTTTAGGTTTAGCGATGGGCTTATTAG
>NC_002947.4:c2800289-2799201 Pseudomonas putida
TGAATGCCGCACTGAAAACCTTCGCCCCAAGCGCACTCGCCCTGCTGCTGATCCTGCCATCCAGCGCCTC

But I need to do this for several genes that i have in a first column of a table, like that:

GeneNameA     OtherColumn    OtherColumn

GeneNameB     OtherColumn    Other Colmn

I am searching for a Perl script that read the first column and pass each GeneName to this space of the entrez command : "X" [Gene/Protein Name], and create a multifasta files that contains the sequences for each Gene separetely.

My programming skills are poor yet and I am stuck in this part. I´ll will be grateful with your help!

gene perl ncbi sequence • 2.4k views
ADD COMMENT
1
Entering edit mode

I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
1
Entering edit mode
6.5 years ago

no need to go for PERL here, some simple bash will do:

IFS=$'\n'
for i in $( cut -f1 table ); do 
 esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> fasta_file
done
unset IFS

table is your list and fasta_file the output file you want to create (one big multi fasta file with all requested sequences)

EDIT: added IFS declaration to avoid splitting on (unwanted) spaces in the search terms

ADD COMMENT
0
Entering edit mode

Thank you so much!

The code work but I forgot to say that my table in excel look like this:

Inositol pyrophosphate synthase  OtherColumn  OtherColumn

There are spaces in the firts column, and the code only read the first world "Inositol"

I translated the excel table to csv and tab separate file but don't work

Any Idea of how to fix it?

ADD REPLY
0
Entering edit mode

Save your file as comma delimited text and then replace cat table | cut -f1 with awk -F ',' '{print $1}' your_file and see if that does the trick.

ADD REPLY
0
Entering edit mode

or save as a tab delineated text file and then the original command should also work, but the one of genomax is probably more secure

ADD REPLY
0
Entering edit mode

Thanks genomax and lieven.sterck. But didn't work, the output is the same.

ADD REPLY
0
Entering edit mode

strange, it should not be.

Can you post the first lines or so of this csv/tab file?

ADD REPLY
0
Entering edit mode

Hi, I only test the script with the first row for now, In this case my $i must be Inositol pyrophosphate synthase, but the script do the query only with inositol, and the fasta created is inositol.fasta. I want the query to be over over the whole name and the file created with the name "Inositol_pyrophosphate_synthase.fasta" or similar. This is the file:

Inositol pyrophosphate synthase Inmovilización  Sintesis de polifosfatos    NA  16

this is the script:

#!/bin/sh
for i in $(  cat prueba.txt | cut -f1 ); do 
 esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [orgn] OR fungi [orgn] OR archaea [orgn]) AND alive [prop]" | efetch -format docsum | xtract -pattern GenomicInfoType -element ChrAccVer -element ChrStart -element ChrStop |xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' >> "$i".fasta
done

I am triyng with tab separate files and comma separate values files but the output is the same ever. I also try with genomax variation but not work too.

ADD REPLY
0
Entering edit mode

are "Inositol" and "pyrophosphate" and "synthase" present in a single column in your excel list? if not neither option given will work I'm afraid

Post the result of head prueba.txt , so we can inspect it

ADD REPLY
0
Entering edit mode

Hi lieven.sterck

Yes in excel "Inositol" and "pyrophosphate" and "synthase" are present in the same column, and I translated this to csv or tab file. The header of prueba.txt is:

Inositol pyrophosphate synthase Inmovilización  Sintesis de polifosfatos    NA  16

Thank you for your help

ADD REPLY
0
Entering edit mode

OK, figured it out. The loop keep splitting on space regardless of the tab or csv file. Here's how to solve it, you need to disable the space splitting of the for loop or alternatively use a while loop in stead.

for loop: set IFS specifically on newlines:

IFS=$'\n'
    for i in $( cat table | cut -f1 ); do 
     j=${i// /_} 
     esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> ${j}.fasta
    done
unset IFS

or using a while loop:

while read i ; do esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> fasta_file ; done <(cat table | cut -f1 )

I also added a line in the for loop code to create nice fasta file names (same can be done in the while loop as well)

I'll amend my original answer as such too

ADD REPLY

Login before adding your answer.

Traffic: 1544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6