Extract Sequence From The Genome?
6
4
Entering edit mode
14.1 years ago
Sam ▴ 90

Hello,

I have a report generated by some analysis tools that end up giving me chromosome start and end locations. Is there any tool out there that can quickly take the start/end locations and provide me with the sequence from the human genome?

Thanks

sequence retrieval • 24k views
ADD COMMENT
0
Entering edit mode

Duplicate of this.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you - we are in the process of correcting older links now and this comment was helpful.

ADD REPLY
6
Entering edit mode
14.1 years ago
Pascal ▴ 130

Hi,

if you need sequences from many positions, I would recommend to set up biopieces. It requires to download and index an entire genome, but you can extract many sequences very fast.

http://code.google.com/p/biopieces/

ADD COMMENT
0
Entering edit mode

Thanks for bringing this up, I was not aware if it. It reminded me of the old SEALS package.

ADD REPLY
0
Entering edit mode

www.biopieces.org

ADD REPLY
6
Entering edit mode
14.1 years ago

The Extract Genomic DNA under the 'fetch sequences' menu in Galaxy will do this. Remember to set the correct human assembly build when you upload your data and it will work automatically.

Galaxy is a great tool for working on coordinate based data and well worth learning.

ADD COMMENT
1
Entering edit mode

Yeah, if you are new to Galaxy you can see our introduction tutorial on that: http://www.openhelix.com/galaxy There are also some exercises that could get you started on using it.

ADD REPLY
0
Entering edit mode

damn someone beat me to recommending Galaxy ;)

ADD REPLY
3
Entering edit mode
14.1 years ago
Rm 8.3k

You can use Entrez Programming Utilities

For example: To retrive "Homo sapiens chromosome Y" from nucleotide 1 to 90 on the reverse strand:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AC_000156&rettype=fasta&seq_start=1&seq_stop=90&strand=2

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?

Database: db=nucleotide

Sequence or chr ID: id=AC_000156

Format: rettype=fasta

sequence Starting nucleotide: seq_start=1

Sequence End: seq_stop=90

Forward (1) or reverse strand(2) on chromosome: strand=2

gnl|ASM:GCF_000000025|Y:c90-1 Homo sapiens chromosome Y, alternate assembly HuRef, whole genome shotgun sequence CACCTGTAATCCCAGCACTTTGGGACACCGAGGTGGACAGATCACCTGAGGTCAGGAGTTCGAGACCAGC CTGGCCAACTTGGTGAAACC

EFetch: Retrieves records in the requested format from a list of one or more unique identifiers. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html

ADD COMMENT
2
Entering edit mode
14.1 years ago
Joachim ★ 2.9k

If you like to retrieve the sequences automatically via a script or program, then you can also use Ensembl's DAS-server. Note that there are various coordinate systems due to various assemblies though and Ensembl currently uses GRCh37. However, you can access Ensembl's archives to query older versions of the genome.

Anyway, you can retrieve sequences by fetching an URL like:

http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=1:100000,110000

This will give you the sequence from base-pairs 100000 to 110000 on the 1st chromosome. The abbreviated output is formatted as follows:

<DASSEQUENCE>
<SEQUENCE id="1" start="100000" stop="110000" version="1.0">
cactaagcacacagagaataatgtctagaatctgagtgccatgttatcaaattgtactga
gactcttgcagtcacacaggctgacatgtaagcatcgccatgcctagtacagactctccc
...
</SEQUENCE>
</DASSEQUENCE>
ADD COMMENT
0
Entering edit mode

Thanks for all the feedback! Galaxy seems to be just what I've been looking for

ADD REPLY
1
Entering edit mode
14.1 years ago
Scott ▴ 10

The next update of NCBI2R (http://ncbi2r.wordpress.com) will have that feature as an R function called GetSequence. However that update won't be released until next week. It works by downloading sequence for an accession number, and can also handle chromosome and position based queries based on the current build of the genome.

disclaimer: it's my package. caveat: that version isn't released just yet. I'm hoping for next week to release it along with some other new functions in a new upgrade of the NCBI2R package.

ADD COMMENT
0
Entering edit mode
6.9 years ago
klues009 • 0

Alternatively, I have ran into issues while doing this in R with the package biomaRt, so here's a work around function for ensembl:

getSeq_ensembl = 
  Vectorize(
    function(chromosome, start, end, strand, species = "Homo_sapiens"){
      url = paste0("https://useast.ensembl.org/", species, "/Export/Output/Location?db=core;flank3_display=0;flank5_display=0;output=fasta;r=",
             chromosome, ":", start, "-", end, ";strand=", strand, 
             ";utr5=yes;cdna=yes;intron=yes;utr3=yes;peptide=yes;coding=yes;genomic=unmasked;exon=yes;_format=Text")
      Biostrings::DNAString(read.csv(url)[1,1])
  },
vectorize.args = c("chromosome", "start", "end", "strand", "species")
)
ADD COMMENT

Login before adding your answer.

Traffic: 2005 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6