Promoter region is kinda arbitrarily defined somewhere between 2/300 bases to 2000 bases upstream of annotated TSS (transcription start sites) from Ensembl/Refseq. So you can extract 2KB region upstream of TSS and get promoter region. CAGE-seq is best way to define TSS, and have a look at this paper https://www.ncbi.nlm.nih.gov/pubmed/24002785 and the data.
You are right in that regulatory feature annotation is only available for human and mouse in Ensembl at the moment. However, regulatory features are annotated independently of genes, and are not directly linked, anyway.
If you are interested in retrieving the 2kb upstream sequence though, you have a number of options, depending on your preferences:
BioMart: http://www.ensembl.org/biomart/martview/
We tend to advise that you retrieve data for around 500 genes per query, but you could split your list of 2000 genes into 4 blocks of 500 genes.
Your gene IDs will be the 'Filters' and then you can choose the required upstream sequence length as the Attribute.
Step 2:
Click 'Filters' in the menu on the left hand side.
Expand 'Gene' panel.
Check the 'Input external references ID list' box to add the filter and add the list of gene IDs into the text box (or upload a file with the list of gene IDs). Also, be sure to select the correct format of gene IDs that you have used as the input from the drop-down list.
Step 3:
Click 'Attributes' in the menu on the left hand side.
Click the 'Sequences' radio-button option.
Expand 'Sequences' panel.
Check the 'Flank (Gene)' option.
Check the 'Upstream flank' box and input the desired length into the text box.
N.B You can also add important information into the sequence header by selecting different options from the options in the 'Header Information' panel.
Step 4:
Click 'Results' button in the top left--hand corner to view and download the results.
Promoter region is kinda arbitrarily defined somewhere between 2/300 bases to 2000 bases upstream of annotated TSS (transcription start sites) from Ensembl/Refseq. So you can extract 2KB region upstream of TSS and get promoter region. CAGE-seq is best way to define TSS, and have a look at this paper https://www.ncbi.nlm.nih.gov/pubmed/24002785 and the data.
Yes but I have to do this for 2000 genes so I cannot do it manually. Thank you for the paper.
Hi Seigfried,
You are right in that regulatory feature annotation is only available for human and mouse in Ensembl at the moment. However, regulatory features are annotated independently of genes, and are not directly linked, anyway.
If you are interested in retrieving the 2kb upstream sequence though, you have a number of options, depending on your preferences:
BioMart: http://www.ensembl.org/biomart/martview/ We tend to advise that you retrieve data for around 500 genes per query, but you could split your list of 2000 genes into 4 blocks of 500 genes.
Your gene IDs will be the 'Filters' and then you can choose the required upstream sequence length as the Attribute.
BioMart tutorial and recorded demo are in the following documentation pages: http://www.ensembl.org/info/data/biomart/index.html
Perl API: Use the slice adaptor to retrieve slices with respect to your gene IDs: https://www.ensembl.org/info/docs/api/core/core_tutorial.html#slices
then, use the expand() method to define your upstream sequence of interest: http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Slice.html#ad16f93a7bf30d48820f421012616d56b
REST API: http://rest.ensembl.org/documentation/info/sequence_id_post Use the POST endpoint with the optional expand_5prime parameter.
I hope this helps you retrieve the data you need.
Best wishes
Ben Ensembl Helpdesk
Hello Ben
Thank you for your reply
i tried BioMart before but I couldn't find the way to select sequences upstream 2kb from TSS.
From your post this specific point : 'choose the required upstream sequence length as the Attribute.' Where can I find this specific option?
As a trial run I set the Coordinate attribute to Start -1 End -2000 like this http://asia.ensembl.org/biomart/martview/515dd4c62e0c878ec268ed9894ad5c16
But these are the gene coordinates with respect to the genome; not what I want. Could you please guide me