Question

Perl script to download gene sequences

0

Entering edit mode

7.4 years ago

dllopezr ▴ 130

Hi everyone!

This entrez command give me an outpout with the genes sequences of a gen/enzymes for various organism:

esearch -db gene -query "glutaminase-asparaginase [Gene/Protein Name] AND (bacteria [orgn] OR fungi [orgn] OR archaea [orgn]) AND alive [prop]" | efetch -format docsum | xtract -pattern GenomicInfoType -element ChrAccVer -element ChrStart -element ChrStop |xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta'

The output is similiar to:

>NC_030957.1:c4121890-4120582 Colletotrichum higginsianum
TGAGAGCTTCTTACTTGTCGACGCTGTTGTTGCCAGCTCTGGTAGCCCATGGTTTCGCCTCCCCAGTCGG
>NC_016603.1:c898826-897759 Acinetobacter pittii
TGTTGACTAAAACTGTTAAATCTTTAGGTTTAGCGATGGGCTTATTAG
>NC_002947.4:c2800289-2799201 Pseudomonas putida
TGAATGCCGCACTGAAAACCTTCGCCCCAAGCGCACTCGCCCTGCTGCTGATCCTGCCATCCAGCGCCTC

But I need to do this for several genes that i have in a first column of a table, like that:

GeneNameA     OtherColumn    OtherColumn

GeneNameB     OtherColumn    Other Colmn

I am searching for a Perl script that read the first column and pass each GeneName to this space of the entrez command : "X" [Gene/Protein Name], and create a multifasta files that contains the sequences for each Gene separetely.

My programming skills are poor yet and I am stuck in this part. I´ll will be grateful with your help!

gene perl ncbi sequence • 3.0k views

ADD COMMENT • link updated 7.4 years ago by WouterDeCoster 48k • written 7.4 years ago by dllopezr ▴ 130

1

Entering edit mode

I added (code) markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 7.4 years ago by WouterDeCoster 48k

score 1 · Answer 1 · 2018-06-09

1

Entering edit mode

7.4 years ago

lieven.sterck 16k

no need to go for PERL here, some simple bash will do:

IFS=$'\n'
for i in $( cut -f1 table ); do 
 esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> fasta_file
done
unset IFS

table is your list and fasta_file the output file you want to create (one big multi fasta file with all requested sequences)

EDIT: added IFS declaration to avoid splitting on (unwanted) spaces in the search terms

ADD COMMENT • link 7.4 years ago by lieven.sterck 16k

0

Entering edit mode

Thank you so much!

The code work but I forgot to say that my table in excel look like this:

Inositol pyrophosphate synthase  OtherColumn  OtherColumn

There are spaces in the firts column, and the code only read the first world "Inositol"

I translated the excel table to csv and tab separate file but don't work

Any Idea of how to fix it?

ADD REPLY • link 7.4 years ago by dllopezr ▴ 130

0

Entering edit mode

Save your file as comma delimited text and then replace cat table | cut -f1 with awk -F ',' '{print $1}' your_file and see if that does the trick.

ADD REPLY • link 7.4 years ago by GenoMax 154k

0

Entering edit mode

or save as a tab delineated text file and then the original command should also work, but the one of genomax is probably more secure

ADD REPLY • link 7.4 years ago by lieven.sterck 16k

0

Entering edit mode

Thanks genomax and lieven.sterck. But didn't work, the output is the same.

ADD REPLY • link 7.4 years ago by dllopezr ▴ 130

0

Entering edit mode

strange, it should not be.

Can you post the first lines or so of this csv/tab file?

ADD REPLY • link 7.4 years ago by lieven.sterck 16k

0

Entering edit mode

Hi, I only test the script with the first row for now, In this case my $i must be Inositol pyrophosphate synthase, but the script do the query only with inositol, and the fasta created is inositol.fasta. I want the query to be over over the whole name and the file created with the name "Inositol_pyrophosphate_synthase.fasta" or similar. This is the file:

Inositol pyrophosphate synthase Inmovilización  Sintesis de polifosfatos    NA  16

this is the script:

#!/bin/sh
for i in $(  cat prueba.txt | cut -f1 ); do 
 esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [orgn] OR fungi [orgn] OR archaea [orgn]) AND alive [prop]" | efetch -format docsum | xtract -pattern GenomicInfoType -element ChrAccVer -element ChrStart -element ChrStop |xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' >> "$i".fasta
done

I am triyng with tab separate files and comma separate values files but the output is the same ever. I also try with genomax variation but not work too.

ADD REPLY • link 7.4 years ago by dllopezr ▴ 130

0

Entering edit mode

are "Inositol" and "pyrophosphate" and "synthase" present in a single column in your excel list? if not neither option given will work I'm afraid

Post the result of head prueba.txt , so we can inspect it

ADD REPLY • link 7.4 years ago by lieven.sterck 16k

0

Entering edit mode

Hi lieven.sterck

Yes in excel "Inositol" and "pyrophosphate" and "synthase" are present in the same column, and I translated this to csv or tab file. The header of prueba.txt is:

Inositol pyrophosphate synthase Inmovilización  Sintesis de polifosfatos    NA  16

Thank you for your help

ADD REPLY • link 7.4 years ago by dllopezr ▴ 130

0

Entering edit mode

OK, figured it out. The loop keep splitting on space regardless of the tab or csv file. Here's how to solve it, you need to disable the space splitting of the for loop or alternatively use a while loop in stead.

for loop: set IFS specifically on newlines:

IFS=$'\n'
    for i in $( cat table | cut -f1 ); do 
     j=${i// /_} 
     esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> ${j}.fasta
    done
unset IFS

or using a while loop:

while read i ; do esearch -db gene -query "$i [Gene/Protein Name] AND (bacteria [org .... >> fasta_file ; done <(cat table | cut -f1 )

I also added a line in the for loop code to create nice fasta file names (same can be done in the while loop as well)

I'll amend my original answer as such too

ADD REPLY • link 7.4 years ago by lieven.sterck 16k