How to retrieve the upstream 100bp sequences in a reference genome like humans for instance? upstream of the 5' UTR?
How to retrieve the upstream 100bp sequences in a reference genome like humans for instance? upstream of the 5' UTR?
Here's my approach, although please note there are much easier ways to do this using the programming language R and the associated packages. If you feel like trying that, have a look at this question for a general guide.
Go to the UCSC website and click on 'Table Browser' in the left-hand column. This will take you to a page where you can download various genome annotations. Enter your genome of interest, what assembly you require, and what track you prefer. Change output format to 'BED - browser extensible data' and then click on the 'get output' button. A new page will load and from here you can choose to create one BED record per feature - in your case click the '5' UTR exons' button. Then click the 'get BED' button to download a file of all the 5' UTR exons.
You now need to get the 1000 bp upstream coordinates. To do this I advise you to use bedtools, which is a command-line tool to query BED files. This approach will require you to have a file which lists the size of each chromosome in your genome. You can use the fetchChromSizes script from UCSC for this purpose.
fetchChromSizes hg38 > hg38.txt
bedtools flank -l 1000 -i exons.bed -g hg38.txt > upstream.bed
Now you need to download the reference genome you specified on the UCSC website (I normally download them from Illumina's iGenomes page). With that done, you can use the following bedtools command to retrieve your sequences:
bedtools getfasta -fi hg38.fasta -bed upstream.bed -fo sequences.fasta
GOTO the UCSC and Tools->TAble Browser -> select parameters -> and then change output to sequence -> get output -> add 100bp upstrim keep everything as it its
If you have many sequences to retrieve you can use the slice tool under the API tools from Ensembl.
http://www.ensembl.org/info/docs/api/core/core_tutorial.html#coordinates
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Upstream from what? The TSS of each gene?