Question

Extract surrounding genome sequences given certain locations using R

1

Entering edit mode

8.3 years ago

DVA ▴ 640

Hello,

I wonder what is the best way/package to extract ~3k bps regions surrounding a certain given location in R. E.g. the location can be Human Chromosome 4 Position 50000. Then I'd like to have the sequence from chr4:47000-53000 of hg19/hg38.

Thank you.

R package genome sequence • 8.5k views

ADD COMMENT • link updated 8.3 years ago by ddiez ★ 2.0k • written 8.3 years ago by DVA ▴ 640

0

Entering edit mode

Do you have to do this with R specifically?

ADD REPLY • link 8.3 years ago by Joe 22k

0

Entering edit mode

Yes (unfortunately).

ADD REPLY • link 8.3 years ago by DVA ▴ 640

6

Entering edit mode

8.3 years ago

ddiez ★ 2.0k

An option is to use the BSgenome packages and the method getSeq.

library(BSgenome.Hsapiens.NCBI.GRCh38)
# this gives you the object Hsapiens, which you can inspect (output not included).
Hsapiens

# get the 3000 bases upstream of position chr4:47000.
getSeq(Hsapiens, "4", start = 47000 - 3000, end = 47000 - 1)
3000-letter "DNAString" instance
seq: TTTTCCTTTAAAATTGGGGTAATAGCCCATATCCTCCACATCTCAGAAGGTTCATTTTTATTGGTCCAAAGGAAG...CTGAGTTCTGTGAGTATTTTAATCAAATTCTCAAACTTGAGAAAGGGGTTATGGGAGTCCTCACTTCTTAGTAGC

ADD COMMENT • link 8.3 years ago by ddiez ★ 2.0k

0

Entering edit mode

Thank you very much.

ADD REPLY • link 8.3 years ago by DVA ▴ 640

0

Entering edit mode

I wrote a small script to extract sequences of multiple SNPs. getSeq function doesn't seem to support it. Any alternate solutions please?

ADD REPLY • link 6.5 years ago by bioinfo456 ▴ 150

0

Entering edit mode

suppressMessages(library(GenomicFeatures))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))

gr=GRanges(seqnames = paste0('chr',seq(1,3)),IRanges(start=c(1,100000,100000),end=c(2,100001,100001)))
tmp=Views(Hsapiens,gr)
sequence=DNAStringSet(as.character(tmp))

ADD REPLY • link 5.8 years ago by rob.tesf • 0

score 3 · Accepted Answer · 2017-01-17

3

Entering edit mode

8.3 years ago

Chun-Jie Liu ▴ 280

You can use biomaRt

GENES = useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "useast.ensembl.org")

YouRange<- getBM(attributes = c("chromosome_name", "start_position", "end_position"), filters = c("chromosome_name","start","end"), values = list(chromosome_name = "4", start = 50000-3000, end = 50000 + 3000), mart = GENES)

You can use getSequence get sequences.

getSequence(chromosome = 4, start = 50000-3000, end = 50000+3000,type='hgnc_symbol',seqType = 'gene_exon_intron', mart = GENES)

getBM helps you get the annotation information from Biomart, getSequence helps you get the sequence you need.

This is a simple example for example question, you can check details of biomaRt to solve your specific question.

ADD COMMENT • link 8.3 years ago by Chun-Jie Liu ▴ 280

0

Entering edit mode

Thank you! I'm going to take a look right now:)

ADD REPLY • link 8.3 years ago by DVA ▴ 640

0

Entering edit mode

Your code gives me an empty data.frame. I had to specify chromosome as "4" instead of "chr4".

ADD REPLY • link 8.3 years ago by ddiez ★ 2.0k

0

Entering edit mode

Thanks for your correctness. This is a simple example for using biomaRt. The previous version of Ensembl support 'chr' character. You can choose the Ensembl release or genome build by useMart or useEnsembl.

The Ensembl give the tutorial.

ADD REPLY • link 8.3 years ago by Chun-Jie Liu ▴ 280

0

Entering edit mode

No problem. I never know what type of nomenclature I have to use myself, since sometimes it seems to be "chr4" and sometimes "4". I though Ensembl was using "chr4" whereas UCSF style was "4" but it seems I was mistaken...

ADD REPLY • link 8.3 years ago by ddiez ★ 2.0k