retrieve 1000 bp upstream sequences
3
0
Entering edit mode
8.7 years ago
spaul8505 ▴ 20

How to retrieve the upstream 100bp sequences in a reference genome like humans for instance? upstream of the 5' UTR?

RNA-Seq • 6.1k views
ADD COMMENT
2
Entering edit mode

Upstream from what? The TSS of each gene?

ADD REPLY
5
Entering edit mode
8.7 years ago
James Ashmore ★ 3.5k

Here's my approach, although please note there are much easier ways to do this using the programming language R and the associated packages. If you feel like trying that, have a look at this question for a general guide.

Download coordinates of 5' UTR exons

Go to the UCSC website and click on 'Table Browser' in the left-hand column. This will take you to a page where you can download various genome annotations. Enter your genome of interest, what assembly you require, and what track you prefer. Change output format to 'BED - browser extensible data' and then click on the 'get output' button. A new page will load and from here you can choose to create one BED record per feature - in your case click the '5' UTR exons' button. Then click the 'get BED' button to download a file of all the 5' UTR exons.

Get coordinates of 1000 bp region upstream of 5 'UTR exons

You now need to get the 1000 bp upstream coordinates. To do this I advise you to use bedtools, which is a command-line tool to query BED files. This approach will require you to have a file which lists the size of each chromosome in your genome. You can use the fetchChromSizes script from UCSC for this purpose.

fetchChromSizes hg38 > hg38.txt
bedtools flank -l 1000 -i exons.bed -g hg38.txt > upstream.bed

Get sequences of upstream regions

Now you need to download the reference genome you specified on the UCSC website (I normally download them from Illumina's iGenomes page). With that done, you can use the following bedtools command to retrieve your sequences:

bedtools getfasta -fi hg38.fasta -bed upstream.bed -fo sequences.fasta
ADD COMMENT
0
Entering edit mode

Current version needs -r

bedtools flank -l 1000 -r 0  -i exons.bed -g hg38.txt > upstream.bed
ADD REPLY
1
Entering edit mode
8.7 years ago

GOTO the UCSC and Tools->TAble Browser -> select parameters -> and then change output to sequence -> get output -> add 100bp upstrim keep everything as it its

ADD COMMENT
0
Entering edit mode

This solution does not make sense.

ADD REPLY
0
Entering edit mode
8.7 years ago
Krisr ▴ 470

If you have many sequences to retrieve you can use the slice tool under the API tools from Ensembl.

http://www.ensembl.org/info/docs/api/core/core_tutorial.html#coordinates

ADD COMMENT

Login before adding your answer.

Traffic: 2737 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6