Retrieve upstream sequences from a gene of interest for many species
2
1
Entering edit mode
10.1 years ago
Polychaete ▴ 10

Let's say I want to find the upstream sequence for ACTA1 for every species of, say, birds that have such data available. There are 45 whole bird genomes along with annotations available here. In total, there are around 50 bird species with genome assemblies available, but they don't seem to be all in one place. Anyway, how can one get the upstream sequence from some gene, say ACTA1, for all 45 species available on gigadb, or for all the birds available on some other database (e.g. NCBI has 28 species: go to http://www.ncbi.nlm.nih.gov/gene and search (ortholog_gene_58[group]) AND "birds"[porgn:__txid8782]).

I prefer answers that use the R programming language, but other approaches are welcome.

genome R • 3.5k views
ADD COMMENT
0
Entering edit mode

Genomes are available here

ADD REPLY
0
Entering edit mode

Thanks for pointing out another place where many (though not all) bird genomes are available!

ADD REPLY
2
Entering edit mode
10.1 years ago
Chris S. ▴ 340

You can use the location data on the Entrez Gene results page to format a E-fetch query in R. For the first record with NC_006090.3 (39337939..39339803, complement), you can get the gene sequence using

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_006090&strand=2&seq_start=39337939&seq_stop=39339803&rettype=fasta&retmode=text"

#readLines(url) OR with Biostrings package
readDNAStringSet(url)

  A DNAStringSet instance of length 1
    width seq                                                                                                                               names               
[1]  1865 ATGTGTGACGAGGACGAGACCACCGCGCTCGTGTGCGACAACGGCTCCGGCCTGGTGAAGGCT...TGGATCACAAAGCAGGAGTACGATGAAGCCGGCCCATCCATTGTCCACCGTAAATGCTTCTAA gi|358485509:c393...

To get the upstream region on the minus strand, use end to end + 100 (or however many upstream bases)

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_006090&strand=2&seq_start=39339803&seq_stop=39339903&rettype=fasta&retmode=text"

The second record from chimney swifts does not have location data listed in the results table, but its on the HTML page as NW_009954494 (369738..371606)

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NW_009954494&strand=1&seq_start=369738&seq_stop=371606&rettype=fasta&retmode=text"

On the plus strand, use start-100 to start

url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NW_009954494&strand=1&seq_start=369638&seq_stop=369738&rettype=fasta&retmode=text"
ADD COMMENT
0
Entering edit mode

This is nice. Are you extracting the location data by manually looking through the HTML page, or are you accessing all that information (record id, start, stop, strand) by scraping the page?

ADD REPLY
1
Entering edit mode
10.1 years ago
Chris S. ▴ 340

You can get the Location data from the esummary results (all except tinamou). Since you have to use the history server to pass results from esearch to esummary, you can use an R package like rentrez or call the E-utilities command line tools below

doc <- xmlParse(system('esearch -query "(ortholog_gene_58[group]) AND birds[porgn:__txid8782])" -db gene | esummary', intern=TRUE))

 xmlToDataFrame(doc["//LocationHistType"])
  AnnotationRelease  AssemblyAccVer      ChrAccVer ChrStart  ChrStop
1                102 GCF_000002315.3    NC_006090.3 39339802 39337938
2                100 GCF_000696875.1 NW_009926046.1    48075    61716
3                100 GCF_000747805.1 NW_009954494.1   369737   371605
4                100 GCF_000692075.1 NW_009902071.1   425342   423455
ADD COMMENT

Login before adding your answer.

Traffic: 2699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6