Making python script for downloading genomes for OMA Analysis

1

Entering edit mode

3.3 years ago

anasjamshed ▴ 140

I want to make a script that will be able to do the following:

Given a list of NCBI species ID, download all genome assemblies for these species. Edit: please see comments for definition of task
Run OMA standalone on these downloaded genomes to infer hierarchical orthology groups.
Add GO annotations to all loci used in OMA analysis.

My plan is to use biopython to fetch the species, then run pyham(https://lab.dessimoz.org/blog/2017/06/29/pyham) to infer hierarchical orthology groups and then use goatolls(https://github.com/tanghaibao/goatools) to add GO annotations.

Is this possible by using all these 3? or should I do something else?

orthology python OMA • 3.6k views

ADD COMMENT • link updated 3.3 years ago by patrickdm ▴ 250 • written 3.3 years ago by anasjamshed ▴ 140

1

Entering edit mode

You want to do orthologue identification with OMA and therefore the first task you describe above needs some correction to be successful:

Given a list of species, according to documentation you have to download a proteome annotation file in FASTA format for each. In particular, you do not need all or any assemblies per species, but the single representative proteome file. The filename should be the name of the genome.
Run OMA or another software for orthologue identification on these files as described in the software's documentation

I have a simple shell script that can download the proteome of the representative genome automatically if it exists. For genomes where the gene predictions pipeline has not been run, it cannot give you anything, however.

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

This might be helpful if you haven't come across it yet:

Expanding the Orthologous Matrix (OMA) programmatic interfaces: REST API and the OmaDB packages for R and Python

ADD REPLY • link 3.3 years ago by Wayne ★ 2.1k

0

Entering edit mode

I want to fetch genomes from NCBI nor from oma

ADD REPLY • link 3.3 years ago by anasjamshed ▴ 140

0

Entering edit mode

First i want to put ncbi species id to download genomes

ADD REPLY • link 3.3 years ago by anasjamshed ▴ 140

1

Entering edit mode

3.3 years ago

Michael 55k

The following shell script can download the genome assembly and proteome file from NCBI for a list of species names. It does a little bit more than is needed for step 1. but you will figure that out. Be careful, there is not much error checking, so if you have typos in the species list or a species doesn't have an annotated proteome, this may fail miserably. It also leaves the results from the Entrez queries around for your records.

You need to have Entrez e-utils installed in your path.

I am ignoring the python tag here because it is not important to make things work.

	#!/bin/sh

	set -u

	# usage: fetchAllGenomesByTaxon.sh Daphnia_pulex Lepeophtheirus_salmonis
	# either use quotes or underscores
	# This is just to show how to define the taxon list inline if you don't want to read taxa from the command line
	#TAXLIST=("Daphnia pulex" "Drosophila melanogaster" "Anopheles gambiae" "Pediculus humanus"
	#"Ixodes scapularis" "Apis mellifera" "Bombyx mori")
	#TAXLIST=("Strigamia maritima")
	WGET_OPTS="-c --random-wait -t 40 -a wget.log"

	TAXLIST=$@
	for TAX in "${TAXLIST[@]}" ; do
	echo getting genome for: $TAX
	#mkdir -p "$TAX" # if you want to create a directory
	#cd "$TAX"
	GENOME=$(esearch -db genome -query "${TAX}"[Organism:exp] \|
	efetch -format docsum \| tee "${TAX}.genome.esearch.docsum")
	ACC=`echo $GENOME \| xtract -pattern DocumentSummary -element Assembly_Accession`
	NAME=`echo $GENOME \| xtract -pattern DocumentSummary -element Assembly_Name`
	echo authoritative genome: $ACC $NAME
	RESULT=$(esearch -db assembly -query "$ACC" \|
	efetch -format docsum \| tee "${TAX}.assembly.esearch.docsum")
	FTPP=`echo $RESULT \| xtract -pattern DocumentSummary -element FtpPath_GenBank`
	TAXID=`echo $RESULT \| xtract -pattern DocumentSummary -element Taxid`
	echo FtpPath: $FTPP
	BASENAME=`basename $FTPP`
	FTPPATHG=$FTPP/$BASENAME'_genomic.fna.gz'
	FTPPATHP=$FTPP/$BASENAME'_protein.faa.gz'
	echo Downloading $FTPPATHG ...

	## get genome data
	wget $WGET_OPTS $FTPPATHG
	BASENAME=`basename $FTPPATHG`
	gunzip -f $BASENAME
	echo Downloading $FTPPATHP ...
	## get protein data
	wget $WGET_OPTS $FTPPATHP # this may throw an error
	if [ "$?" -eq "0" ] ; then
	BASENAME=`basename $FTPPATHP`
	gunzip -f $BASENAME
	fi
	# cd ..
	done

view raw fetchAllGenomesByTaxon.sh hosted with ❤ by GitHub

ADD COMMENT • link 3.3 years ago by Michael 55k

0

Entering edit mode

i need to do it either by python or R

ADD REPLY • link 3.3 years ago by anasjamshed ▴ 140

2

Entering edit mode

Why, if it works? After you download everything, the next step is to invoke OMA via the command line. Whether you wrap this process in python or R makes no difference. Of course, you can write similar code like the above in python or R. For R, there is the package biomartr which can download genomic data from different sources. For python, there should be a solution in biopython, and a related question on Biostars here: Download NCBI genome sequences from Python

Possibly, someone else can help you with such an implementation, but it won't be substantially easier or less error-prone than using my script.

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

Is it possible to download these species/ genomes through python:

1)https://www.ncbi.nlm.nih.gov/datasets/genomes/?taxon=50557&utm_source=genome&utm_medium=referral&utm_campaign=KnownItemSensor:taxname

2)https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/938171/

3)https://omabrowser.org/All/oma-species.txt

ADD REPLY • link 3.3 years ago by anasjamshed ▴ 140

2

Entering edit mode

It is definitely possible but futile for OMA analysis. There is no genome annotation for Abrostola tripartite hence no proteome, and the other links point to multiple taxa. I am not sure why you insist on Python (guessing 'assignment' or 'order from your boss'), but if you need such a python solution, I cannot help you. I am bumping this post to allow others to see it and possibly help out, but I personally think that it is best to approach the problem in a solution-oriented, not in a tool-centric way.

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

Is this doable by R?

ADD REPLY • link 3.3 years ago by anasjamshed ▴ 140

0

Entering edit mode

Yes definitely :)

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

I mean using R against any genome present in links?

ADD REPLY • link 3.3 years ago by anasjamshed ▴ 140

1

Entering edit mode

If you want to infer OMA HOGs you will need to have the protein sequences for all your genomes. Either you restrict yourself to only genomes that have already annotated protein sequences available, or you first need to infer them yourself. There are tons of tools and pipelines for that, but it won't be easy very easy to do.

Michael's script is very helpful to download the genomes and also the protein sequences if available. You shouldn't insist on it being a python script in my view. His code makes use of the EntrezTool from NCBI, which is perfect. Biopython has also a wrapper to it, so you could rewrite Michael's script in python if you (or your boss) insists.

To download the genomes from OMA, you also have an export function ( https://omabrowser.org/export ) where you can select your genomes of interest and export a tarball including oma standalone and the precomputed All-vs-All homology search files.

Cheers Adrian

ADD REPLY • link 3.3 years ago by Adrian Altenhoff ★ 1.1k

0

Entering edit mode

[..] you also have an export function ( https://omabrowser.org/export ) where you can select your genomes of interest and export a tarball including oma standalone and the precomputed All-vs-All homology search files

More on this in How to build phylogenetic species trees with OMA - (Protocol 2). Hth.

ADD REPLY • link 3.3 years ago by patrickdm ▴ 250

Login before adding your answer.