Get Flanking Sequence Given A List Of Positions
4
6
Entering edit mode
13.7 years ago
Stephen 2.8k

Hi.

I have a list of ~1000 polymorphic positions (mapped to hg19/b37) spanning a gene region on chromosome 12. Only about half of these are in dbSNP - the rest are either novel or sequencing errors. I want to genotype these SNPs in a second sample, and to design the assays I need flanking sequence.

My question is - given a list of positions, only half of which are in dbSNP, how can I get flanking sequence?

I read the answers to a similar question from Kiriya. I've tried using Biomart, UCSC, and Galaxy, but get no results returned when I enter the positions in the formats required by those tools.

Thanks.

snp sequence biomart ucsc galaxy • 17k views
ADD COMMENT
0
Entering edit mode

how big is the gene region? Can't you just download a subsequence of fasta chr12 and parse that?

ADD REPLY
8
Entering edit mode
13.7 years ago
Adrian Cortes ▴ 550

I use R for this:

library('BSgenome.Hsapiens.UCSC.hg19')
chr <- 'chr19'
position <- 59900243
alleles <- '[T/C]'
offset <- 60

seq <- paste(getSeq(Hsapiens,chr,position-offset,position-1),
             alleles,
             getSeq(Hsapiens,chr,position+1,position+offset),
             sep='')

I hope that helps.

ADD COMMENT
1
Entering edit mode

This worked wonderfully. For anyone else who hasn't used bioconductor before, you'll need to run a few lines of code to get this working. source("http://www.bioconductor.org/biocLite.R"); source("http://www.bioconductor.org/biocLite.R"); biocLite("BSgenome"); biocLite("BSgenome.Hsapiens.UCSC.hg19"); library('BSgenome.Hsapiens.UCSC.hg19')

ADD REPLY
0
Entering edit mode

Adrian, thanks for the tip. This look like it will work. I'll give this a try after downloading the 850 MB human genome with biocLite("BSgenome.Hsapiens.UCSC.hg19").

ADD REPLY
7
Entering edit mode
13.7 years ago
brentp 24k

You can use bedtools to do this. Say you have a file positions.bed that looks like

  chr11 1234    1234
  chr11 91234   91234
  chr11 191234  191234
  chr11 391234  391234

and the corresponding fasta from your genome (here, chr11.fa) then the following script:

FLANK_BP=25
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
    "select chrom, size from hg19.chromInfo"  > hg19.genome
flankBed  -i positions.bed -g hg19.genome -b $FLANK_BP > flank.bed

fastaFromBed -fi chr11.fa -bed flank.bed -fo flank.fasta

will create flank.fasta with the sequences you are looking for. You can get different up/downstream distances from flankBed, so check out the options on that as well.

ADD COMMENT
0
Entering edit mode

Bedtools API has changed. Use :

  bedtools getfasta -fi chr11.fa -bed flank.bed -fo flank.fasta
ADD REPLY
4
Entering edit mode
13.7 years ago
Joachim ★ 2.9k

Please try BioMart again, because I just toyed around with the Ensembl Gene 61 mart and it does return parts of the flanking sequence.

  • go to http://www.biomart.org/biomart/martview
  • select 'ENSEMBL GENES 61' as database
  • select 'Homo sapiens genes' as dataset
  • click on 'Attributes' and then on the 'Sequences' radio-button
  • for the example I selected 'Flank (Gene)'
  • in 'Upstream flank' I entered 10 just as an example
  • click on 'Filters'
  • select chromosome 12
  • as base pairs I entered 131780000 and 131783000 (which spans over ENSG00000248703, 131780658 to 131782517)
  • click on 'Results'

The FASTA result is:

>ENSG00000248703|ENST00000508505
AAATCCCTGC

If you click on 'XML', you are given a small XML-document that you can use to query the mart programmatically (see http://www.biomart.org/user-docs.pdf). For this example, the XML looks like this:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Query> [?]

<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
    <Filter name = "upstream_flank" value = "10"/>
    <Filter name = "chromosome_name" value = "12"/>
    <Filter name = "end" value = "131783000"/>
    <Filter name = "start" value = "131780000"/>
    <Attribute name = "ensembl_gene_id" />
    <Attribute name = "ensembl_transcript_id" />
    <Attribute name = "gene_flank" />
</Dataset>

[?]

If you click on 'Perl', BioMart generates some Perl code for you as an alternative to running the XML-queries. It should be fairly easy to modify the Perl code to your needs.

ADD COMMENT
1
Entering edit mode

Hm. The strands are encoded as '1' and '-1'. Can you please try these values? I get a result back when I use '-1' instead of '+' in your example.

ADD REPLY
0
Entering edit mode

Thanks for the tip. I followed your instructions but I'm trying to use the "Multiple Chromosomal Regions" checkbox instead of the Chromosome/Base Pair check box. It says to enter (Chr:Start:End:Strand). When I input this:

12:131780658:131780658:+ 12:131782517:131782517:+

I get no results returned. I need to provide my assay designer with a list of positions, and the flanking region around each SNP:

sequence chrom position build GTCTC[T/C]CAGC 19 59900243 36 cctgc[A/G]tcagcc 19 59751810 36 cacatct[T/C]gtacag 19 59814896 36 CCACA[T/C]AGAC 19 59976309 36

One SNP per line.

ADD REPLY
0
Entering edit mode

That didn't format correctly. I need to give my assay designer something that looks like [?]this[?]:

http://pastebin.com/765gMUi6

ADD REPLY
0
Entering edit mode

Thanks for the tip. I followed your instructions but I'm trying to use the "Multiple Chromosomal Regions" checkbox instead of the Chromosome/Base Pair check box. It says to enter (Chr:Start:End:Strand). When I input this: 12:131780658:131780658:+ [?] 12:131782517:131782517:+ I get no results returned. I need to provide my assay designer with a list of positions, and the flanking region around each SNP:

http://pastebin.com/765gMUi6

ADD REPLY
0
Entering edit mode

Ah, that worked, but I'm still getting only a single flanking sequence back, not flanking info for each of the sites I enter.

ADD REPLY
0
Entering edit mode

True. If you are still interested in using BioMart, can you please raise this issue on the BioMart mailing list? Thanks. http://www.biomart.org/contact.html

ADD REPLY
0
Entering edit mode

Unfortunately BioMart will only return values anchored to the gene, if you go through Ensembl Genes, or variation, if you start with database Ensembl variations. When you enter a coordinate in Ensembl Genes, the filter will be applied for genes in that region. The flanking sequence will be exported for the gene, not for the original sequence position you entered.

You can use the Ensembl API to do this query- if you're interestedm it's the Core API (Perl)

http://www.ensembl.org/info/docs/api/core/index.html

http://www.ensembl.org/info/docs/api/core/core_tutorial.html

ADD REPLY
2
Entering edit mode
13.7 years ago
Bert Overduin ★ 3.7k

This is also pretty straightforward with the Ensembl Perl API:

#!/usr/bin/perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $slice_adaptor = $reg->get_adaptor( 'human', 'core', 'slice');

my $chromosome = '7';
my $position = 24966446;
my $offset = 60;

my $slice = $slice_adaptor->fetch_by_region('chromosome', $chromosome, $position-$offset, $position+$offset);

print $slice->seq;
ADD COMMENT

Login before adding your answer.

Traffic: 2113 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6