Wondering about easiest way to get a BED file of regions +/- 2kb around TSSs through table browser?
Wondering about easiest way to get a BED file of regions +/- 2kb around TSSs through table browser?
If you want to do this on the command line, you can do the following to get hg38
refSeq annotations (adjust for your genome of interest):
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz | gunzip -c - | awk 'BEGIN{ OFS="\t" }{ print $3, $5, $6, $13, $10, $4 }' - | sort-bed - > refGene.bed
Then split them by strand and pad around the stranded-start position of the annotation:
$ awk '($6 == "+") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($2 > 2000){ print $1, ($2 - 2000), ($2 + 2000), $4, $5, $6 }' > refGene.tss.for.padded.bed
$ awk '($6 == "-") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($3 > 2000){ print $1, ($3 - 2000), ($3 + 2000), $4, $5, $6 }' > refGene.tss.rev.padded.bed
$ bedops --everything refGene.tss.for.padded.bed refGene.tss.rev.padded.bed > refGene.tss.padded.bed
Generally, that could give problems if a padded TSS ends up going outside chromosomal bounds. You could filter them out with a mix of Kent and BEDOPS tools:
$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' | sort-bed - > hg38.bounds.bed
$ bedops --element-of 100% refGene.tss.padded.bed hg38.bounds.bed > refGene.tss.padded.filtered.bed
For 2k of padding, this is probably not a major issue. I don't believe one would generally find many genes (and their TSSs) within 2k of the tips of chromosomes. Nonetheless, this would deal with this scenario in a general way.
The general idea is to choose BED as the output format, and from the "get output" page add some padding bases to the "upstream" and "downstream" options. Unfortunately you can't add both upstream and downstream bases at the same time, so you'll have to merge the files when you're done. The txStart field of the gene track you're interested in will be the TSS.
Alternatively, you could choose the selected fields from primary and related tables option, and the select the chrom, txStart, and txEnd fields, and use awk or something similar to add/subtract 2kb to each position, but this gets somewhat tricky for negative strand genes, so the above option is recommended instead.
Please be aware that UCSC has a publicly archived mailing list that you can search to see if your question has been previously answered. Here is an example of a similar question from our archive: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/find$20TSS/genome/567C63DhAo0/9p5vA5PktAYJ
The general link to the mailing list is here, where you can search for and ask UCSC Genome Browser specific questions: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome
Thanks,
ChrisL from the UCSC Genome Browser
Actually the awk script in the thread below does this really nicely with a simple BED file of all refSeq genes.
A: How to get promoter coordinates of hg19 from UCSC genome browser ?
In addition to the answers above, another way to do this is with bedtools slop. You'll need the bed file containing the TSS from UCSC table browser, as well as a genome file. This will add the requested amount of base pairs to each end, accounting for chromosome sizes, etc. using the genome file.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Very helpful thanks! I am trying to modify this with a subset of TSSs from genes I have highly expressed in an RNAseq experiment, what would be the most straightforward way to do that?
Once you make the file
refGene.tss.padded.bed
, you could filter this withbedops --element-of 1
. Say you have asort-bed
-sorted set of TSSs from your RNAseq experiment, which is calledrnaseq.tss.bed
:The file
filtered.refGene.tss.padded.bed
will contain a subset of TSSs fromrefGene.tss.padded.bed
that overlap the set of RNAseq-derived TSSs.Thanks again! One final question, in this particular instance if I want to use the refGene.tss.padded.bed file as a drop out, for instance take a set of intervals and drop out refGene.tss.padded.bed intervals what would be the most straightforward way to do that? Just want a final bed file without those intersections of "promoters".
The option
--not-element-of
or-n
may be what you're after. It works like the inverse of--element-of
. That is, given:The file
answer.bed
will contain elements fromrefGene.tss.padded.bed
that do not overlapintervals.bed
by one or more bases. This is the least stringent overlap criterion.If you want to filter out only those elements that overlap completely, use
--not-element-of 100%
or-n 100%
instead of-n 1
. This case is the most stringent overlap criterion — all bases of an element ofrefGene.tss.padded.bed
must overlap an element inintervals.bed
to be filtered out.Its strange but when I do this backwards and forwards with my refGene.tss.padded.bed and target "intervals.bed" I get that answer.bed is a full overlap, for instance if my intervals.bed has 1000 intervals and refGene.tss.padded.bed has 39702 intervals and I do the following:
Tried it with other data I have and the same happens. I should mention that I know some of those 1000 intervals fall within +/- 2kb of TSS. Am I doing something incorrectly? Thanks!
I would suggest checking that both inputs are sorted per
sort-bed
. I would also do a similarbedops -e 1
andwc -l
test. You should get the inverse result, ie if there are 1000 elements and you expect that 800 of them do not overlap another set, then the remaining 200 should, if that makes sense.If I have a list of common gene names such as the following:
Can the approach be modified to only get the TSS information for those particular genes? I really like the way bedops does it here but couldn't figure this out. Thank you!
Can you tell me what assembly you're working with? I'm guessing mouse, but I don't know which one. Let me know and I'll paste in some scripts that might help you.
Here's one way you might get mouse gene TSSs for
mm10
orGRCm38
:Then to filter them:
this should also work for gencode gtf files as well
Yes I am using mm10, this is very helpful thank you!
One final question on this, I am trying to see if these particular peaks are within a certain distance of the TSSs this code allows me to isolate, lets say 1MB on either side but if I wanted to to adjust in the downstream direction also for the length of the gene is there a way that closest-feature can accomplish this?
I'm sorry, but I don't understand your question. Can you describe a little more what operation you're trying to do, or what the adjustment is?
Basically Id like to take this set of filtered TSSs and use
closest-features
to filter a peak list for those peaks that fall within lets say 1 mb of this TSS list, however adjusting the downstream measure by the length of the gene harboring the TSS. Does that sort of make sense? Basically I guess its a modification of the--dist
flag in closest-features.Removing
--closest
and using the--dist
operator will report distances for the nearest elements upstream and downstream of the padded TSS.You could pipe the result of that to
awk
or a Python script, usingsplit()
or similar function to chop up each result line, and use the distance measure with the size of the gene to apply your logic, to decide what to do with that result.The output of
--dist
uses the the|
delimiter, useful for downstream parsing. You can use the--delim <delimiter>
operation to change this delimiter.Note that adding
--no-overlaps
can help to avoid reporting an overlapping element as a "nearest neighbor". You might want to be careful of those guys.There are a couple files to play with at the bottom of the documentation page, which can help with a bit more of a concrete demo: http://bedops.readthedocs.io/en/latest/content/reference/set-operations/closest-features.html
Very helpful thank you!
Is there a way to set the
--delim
to establish the distance as a separate column?You could use
\t
(tab) to make distance its own column in a tab-delimited file.Yes I actually tried this (--delim \t) but comes up looking like the following:
So its taking the last column in my peak file (FDR) and adding on the distance. Kind of odd.
Try adding ticks:
--delim '\t'
.The gencode retrieval you suggest here is very helpful, I inputted a .txt file of 380 common gene names and received back a
gencode.vM18.tss.filtered.bed
that had 359 IDs. Is there a way to easily identify which were not converted and maybe a way to add those as well? Or do you do this manually? I assume its a lot of riken type things but unsure. Thanks again!You could extract the list of gene names that were matched:
Then
grep -v
to find genes fromgenes.txt
that do not have matches inmatches.txt
:Then maybe explore
not-found.txt
to see what tweaks are needed, if any can be done, to relax your search parameters.