Question

Extract Sequence From The Genome?

4

Entering edit mode

14.2 years ago

Sam ▴ 90

Hello,

I have a report generated by some analysis tools that end up giving me chromosome start and end locations. Is there any tool out there that can quickly take the start/end locations and provide me with the sequence from the human genome?

Thanks

sequence retrieval • 24k views

ADD COMMENT • link updated 7.0 years ago by klues009 • 0 • written 14.2 years ago by Sam ▴ 90

0

Entering edit mode

Duplicate of this.

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 14.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

now it becomes How To Get The Sequence Of A Genomic Region From Ucsc?.

ADD REPLY • link 6.1 years ago by hsiaoyi0504 ▴ 70

0

Entering edit mode

Thank you - we are in the process of correcting older links now and this comment was helpful.

ADD REPLY • link 5.3 years ago by Ram 44k

score 6 · Answer 1 · 2010-10-26

6

Entering edit mode

14.2 years ago

Pascal ▴ 130

Hi,

if you need sequences from many positions, I would recommend to set up biopieces. It requires to download and index an entire genome, but you can extract many sequences very fast.

http://code.google.com/p/biopieces/

ADD COMMENT • link 14.2 years ago by Pascal ▴ 130

0

Entering edit mode

Thanks for bringing this up, I was not aware if it. It reminded me of the old SEALS package.

ADD REPLY • link 14.2 years ago by Alastair Kerr 5.3k

0

Entering edit mode

www.biopieces.org

ADD REPLY • link 13.6 years ago by Martin A Hansen 3.0k

score 6 · Answer 2 · 2010-10-26

6

Entering edit mode

14.2 years ago

Alastair Kerr 5.3k

The Extract Genomic DNA under the 'fetch sequences' menu in Galaxy will do this. Remember to set the correct human assembly build when you upload your data and it will work automatically.

Galaxy is a great tool for working on coordinate based data and well worth learning.

ADD COMMENT • link 14.2 years ago by Alastair Kerr 5.3k

1

Entering edit mode

Yeah, if you are new to Galaxy you can see our introduction tutorial on that: http://www.openhelix.com/galaxy There are also some exercises that could get you started on using it.

ADD REPLY • link 14.2 years ago by Mary 11k

0

Entering edit mode

damn someone beat me to recommending Galaxy ;)

ADD REPLY • link 14.2 years ago by Will 4.6k

score 3 · Answer 3 · 2010-10-26

You can use Entrez Programming Utilities

For example: To retrive "Homo sapiens chromosome Y" from nucleotide 1 to 90 on the reverse strand:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AC_000156&rettype=fasta&seq_start=1&seq_stop=90&strand=2

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?

Database: db=nucleotide

Sequence or chr ID: id=AC_000156

Format: rettype=fasta

sequence Starting nucleotide: seq_start=1

Sequence End: seq_stop=90

Forward (1) or reverse strand(2) on chromosome: strand=2

gnl|ASM:GCF_000000025|Y:c90-1 Homo sapiens chromosome Y, alternate assembly HuRef, whole genome shotgun sequence CACCTGTAATCCCAGCACTTTGGGACACCGAGGTGGACAGATCACCTGAGGTCAGGAGTTCGAGACCAGC CTGGCCAACTTGGTGAAACC

EFetch: Retrieves records in the requested format from a list of one or more unique identifiers. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html

Ram · Answer 4 · 2010-10-26

If you like to retrieve the sequences automatically via a script or program, then you can also use Ensembl's DAS-server. Note that there are various coordinate systems due to various assemblies though and Ensembl currently uses GRCh37. However, you can access Ensembl's archives to query older versions of the genome.

Anyway, you can retrieve sequences by fetching an URL like:

http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=1:100000,110000

This will give you the sequence from base-pairs 100000 to 110000 on the 1st chromosome. The abbreviated output is formatted as follows:

<DASSEQUENCE>
<SEQUENCE id="1" start="100000" stop="110000" version="1.0">
cactaagcacacagagaataatgtctagaatctgagtgccatgttatcaaattgtactga
gactcttgcagtcacacaggctgacatgtaagcatcgccatgcctagtacagactctccc
...
</SEQUENCE>
</DASSEQUENCE>

score 1 · Answer 5 · 2010-10-26

The next update of NCBI2R (http://ncbi2r.wordpress.com) will have that feature as an R function called GetSequence. However that update won't be released until next week. It works by downloading sequence for an accession number, and can also handle chromosome and position based queries based on the current build of the genome.

disclaimer: it's my package. caveat: that version isn't released just yet. I'm hoping for next week to release it along with some other new functions in a new upgrade of the NCBI2R package.

score 0 · Answer 6 · 2018-01-13

Alternatively, I have ran into issues while doing this in R with the package biomaRt, so here's a work around function for ensembl:

getSeq_ensembl = 
  Vectorize(
    function(chromosome, start, end, strand, species = "Homo_sapiens"){
      url = paste0("https://useast.ensembl.org/", species, "/Export/Output/Location?db=core;flank3_display=0;flank5_display=0;output=fasta;r=",
             chromosome, ":", start, "-", end, ";strand=", strand, 
             ";utr5=yes;cdna=yes;intron=yes;utr3=yes;peptide=yes;coding=yes;genomic=unmasked;exon=yes;_format=Text")
      Biostrings::DNAString(read.csv(url)[1,1])
  },
vectorize.args = c("chromosome", "start", "end", "strand", "species")
)