Question

Retrieve upstream sequences from a gene of interest for many species

1

Entering edit mode

10.1 years ago

Polychaete ▴ 10

Let's say I want to find the upstream sequence for ACTA1 for every species of, say, birds that have such data available. There are 45 whole bird genomes along with annotations available here. In total, there are around 50 bird species with genome assemblies available, but they don't seem to be all in one place. Anyway, how can one get the upstream sequence from some gene, say ACTA1, for all 45 species available on gigadb, or for all the birds available on some other database (e.g. NCBI has 28 species: go to http://www.ncbi.nlm.nih.gov/gene and search (ortholog_gene_58[group]) AND "birds"[porgn:__txid8782]).

I prefer answers that use the R programming language, but other approaches are welcome.

genome R • 3.5k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Polychaete ▴ 10

0

Entering edit mode

Genomes are available here

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by onuralp ▴ 190

0

Entering edit mode

Thanks for pointing out another place where many (though not all) bird genomes are available!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Polychaete ▴ 10

Ram · Answer 1 · 2014-11-04

You can use the location data on the Entrez Gene results page to format a E-fetch query in R. For the first record with NC_006090.3 (39337939..39339803, complement), you can get the gene sequence using

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_006090&strand=2&seq_start=39337939&seq_stop=39339803&rettype=fasta&retmode=text"

#readLines(url) OR with Biostrings package
readDNAStringSet(url)

  A DNAStringSet instance of length 1
    width seq                                                                                                                               names               
[1]  1865 ATGTGTGACGAGGACGAGACCACCGCGCTCGTGTGCGACAACGGCTCCGGCCTGGTGAAGGCT...TGGATCACAAAGCAGGAGTACGATGAAGCCGGCCCATCCATTGTCCACCGTAAATGCTTCTAA gi|358485509:c393...

To get the upstream region on the minus strand, use end to end + 100 (or however many upstream bases)

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_006090&strand=2&seq_start=39339803&seq_stop=39339903&rettype=fasta&retmode=text"

The second record from chimney swifts does not have location data listed in the results table, but its on the HTML page as NW_009954494 (369738..371606)

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NW_009954494&strand=1&seq_start=369738&seq_stop=371606&rettype=fasta&retmode=text"

On the plus strand, use start-100 to start

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NW_009954494&strand=1&seq_start=369638&seq_stop=369738&rettype=fasta&retmode=text"

Ram · Answer 2 · 2014-11-17

You can get the Location data from the esummary results (all except tinamou). Since you have to use the history server to pass results from esearch to esummary, you can use an R package like rentrez or call the E-utilities command line tools below

doc <- xmlParse(system('esearch -query "(ortholog_gene_58[group]) AND birds[porgn:__txid8782])" -db gene | esummary', intern=TRUE))

 xmlToDataFrame(doc["//LocationHistType"])
  AnnotationRelease  AssemblyAccVer      ChrAccVer ChrStart  ChrStop
1                102 GCF_000002315.3    NC_006090.3 39339802 39337938
2                100 GCF_000696875.1 NW_009926046.1    48075    61716
3                100 GCF_000747805.1 NW_009954494.1   369737   371605
4                100 GCF_000692075.1 NW_009902071.1   425342   423455