Table browser +/- 2Kb of TSS export
3
1
Entering edit mode
7.6 years ago
rbronste ▴ 420

Wondering about easiest way to get a BED file of regions +/- 2kb around TSSs through table browser?

UCSC table broswer • 7.0k views
ADD COMMENT
5
Entering edit mode
7.6 years ago

If you want to do this on the command line, you can do the following to get hg38 refSeq annotations (adjust for your genome of interest):

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz | gunzip -c - | awk 'BEGIN{ OFS="\t" }{ print $3, $5, $6, $13, $10, $4  }' - | sort-bed - > refGene.bed

Then split them by strand and pad around the stranded-start position of the annotation:

$ awk '($6 == "+") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($2 > 2000){ print $1, ($2 - 2000), ($2 + 2000), $4, $5, $6  }' > refGene.tss.for.padded.bed
$ awk '($6 == "-") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($3 > 2000){ print $1, ($3 - 2000), ($3 + 2000), $4, $5, $6  }' > refGene.tss.rev.padded.bed
$ bedops --everything refGene.tss.for.padded.bed refGene.tss.rev.padded.bed > refGene.tss.padded.bed

Generally, that could give problems if a padded TSS ends up going outside chromosomal bounds. You could filter them out with a mix of Kent and BEDOPS tools:

$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' | sort-bed - > hg38.bounds.bed
$ bedops --element-of 100% refGene.tss.padded.bed hg38.bounds.bed > refGene.tss.padded.filtered.bed

For 2k of padding, this is probably not a major issue. I don't believe one would generally find many genes (and their TSSs) within 2k of the tips of chromosomes. Nonetheless, this would deal with this scenario in a general way.

ADD COMMENT
0
Entering edit mode

Very helpful thanks! I am trying to modify this with a subset of TSSs from genes I have highly expressed in an RNAseq experiment, what would be the most straightforward way to do that?

ADD REPLY
0
Entering edit mode

Once you make the file refGene.tss.padded.bed, you could filter this with bedops --element-of 1. Say you have a sort-bed-sorted set of TSSs from your RNAseq experiment, which is called rnaseq.tss.bed:

$ sort-bed rnaseq.tss.unsorted.bed > rnaseq.tss.bed
$ bedops --element-of 1 refGene.tss.padded.bed rnaseq.tss.bed > filtered.refGene.tss.padded.bed

The file filtered.refGene.tss.padded.bed will contain a subset of TSSs from refGene.tss.padded.bed that overlap the set of RNAseq-derived TSSs.

ADD REPLY
0
Entering edit mode

Thanks again! One final question, in this particular instance if I want to use the refGene.tss.padded.bed file as a drop out, for instance take a set of intervals and drop out refGene.tss.padded.bed intervals what would be the most straightforward way to do that? Just want a final bed file without those intersections of "promoters".

ADD REPLY
1
Entering edit mode

The option --not-element-of or -n may be what you're after. It works like the inverse of --element-of. That is, given:

$ bedops -n 1 refGene.tss.padded.bed intervals.bed > answer.bed

The file answer.bed will contain elements from refGene.tss.padded.bed that do not overlap intervals.bed by one or more bases. This is the least stringent overlap criterion.

If you want to filter out only those elements that overlap completely, use --not-element-of 100% or -n 100% instead of -n 1. This case is the most stringent overlap criterion — all bases of an element of refGene.tss.padded.bed must overlap an element in intervals.bed to be filtered out.

ADD REPLY
0
Entering edit mode

Its strange but when I do this backwards and forwards with my refGene.tss.padded.bed and target "intervals.bed" I get that answer.bed is a full overlap, for instance if my intervals.bed has 1000 intervals and refGene.tss.padded.bed has 39702 intervals and I do the following:

bedops -n 1 refGene.tss.padded.bed intervals.bed > answer.bed
wc -l answer.bed
39702 answer.bed

Tried it with other data I have and the same happens. I should mention that I know some of those 1000 intervals fall within +/- 2kb of TSS. Am I doing something incorrectly? Thanks!

ADD REPLY
0
Entering edit mode

I would suggest checking that both inputs are sorted per sort-bed. I would also do a similar bedops -e 1 and wc -l test. You should get the inverse result, ie if there are 1000 elements and you expect that 800 of them do not overlap another set, then the remaining 200 should, if that makes sense.

ADD REPLY
0
Entering edit mode

If I have a list of common gene names such as the following:

Kcnip1
Nat9
Zfp523
Akap1
Kif19a
St18
Mettl1
Uba7
Pde3b
Gstm7
H13
Slc20a1
Lrig2
Elovl2
Zfp185
Trmt112
Egfr
Klhdc1
Cand2
Macrod1
Kcns1
Slc35a5
Anp32b

Can the approach be modified to only get the TSS information for those particular genes? I really like the way bedops does it here but couldn't figure this out. Thank you!

ADD REPLY
1
Entering edit mode

Can you tell me what assembly you're working with? I'm guessing mouse, but I don't know which one. Let me know and I'll paste in some scripts that might help you.

ADD REPLY
1
Entering edit mode

Here's one way you might get mouse gene TSSs for mm10 or GRCm38:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' - \
    | convert2bed -i gff - \
    | awk -vwindow=1 -vOFS="\t" '($6=="+"){ print $1, ($2 - window), $2, $4, ".", $6, $7, $8, $9, $10 }($6=="-"){ print $1, $3, ($3 + window), $4, ".", $6, $7, $8, $9, $10 }' \
    > gencode.vM18.tss.bed

Then to filter them:

$ grep -w -F -i -f genes.txt gencode.vM18.tss.bed > gencode.vM18.tss.filtered.bed
ADD REPLY
1
Entering edit mode

this should also work for gencode gtf files as well

ADD REPLY
0
Entering edit mode

Yes I am using mm10, this is very helpful thank you!

ADD REPLY
0
Entering edit mode

One final question on this, I am trying to see if these particular peaks are within a certain distance of the TSSs this code allows me to isolate, lets say 1MB on either side but if I wanted to to adjust in the downstream direction also for the length of the gene is there a way that closest-feature can accomplish this?

ADD REPLY
1
Entering edit mode

I'm sorry, but I don't understand your question. Can you describe a little more what operation you're trying to do, or what the adjustment is?

ADD REPLY
0
Entering edit mode

Basically Id like to take this set of filtered TSSs and use closest-features to filter a peak list for those peaks that fall within lets say 1 mb of this TSS list, however adjusting the downstream measure by the length of the gene harboring the TSS. Does that sort of make sense? Basically I guess its a modification of the --dist flag in closest-features.

ADD REPLY
1
Entering edit mode

Removing --closest and using the --dist operator will report distances for the nearest elements upstream and downstream of the padded TSS.

You could pipe the result of that to awk or a Python script, using split() or similar function to chop up each result line, and use the distance measure with the size of the gene to apply your logic, to decide what to do with that result.

The output of --dist uses the the | delimiter, useful for downstream parsing. You can use the --delim <delimiter> operation to change this delimiter.

Note that adding --no-overlaps can help to avoid reporting an overlapping element as a "nearest neighbor". You might want to be careful of those guys.

There are a couple files to play with at the bottom of the documentation page, which can help with a bit more of a concrete demo: http://bedops.readthedocs.io/en/latest/content/reference/set-operations/closest-features.html

ADD REPLY
0
Entering edit mode

Very helpful thank you!

Is there a way to set the --delim to establish the distance as a separate column?

ADD REPLY
1
Entering edit mode

You could use \t (tab) to make distance its own column in a tab-delimited file.

ADD REPLY
0
Entering edit mode

Yes I actually tried this (--delim \t) but comes up looking like the following:

0.000758255664811437t-42960

So its taking the last column in my peak file (FDR) and adding on the distance. Kind of odd.

ADD REPLY
1
Entering edit mode

Try adding ticks: --delim '\t'.

ADD REPLY
0
Entering edit mode

The gencode retrieval you suggest here is very helpful, I inputted a .txt file of 380 common gene names and received back a gencode.vM18.tss.filtered.bed that had 359 IDs. Is there a way to easily identify which were not converted and maybe a way to add those as well? Or do you do this manually? I assume its a lot of riken type things but unsure. Thanks again!

ADD REPLY
0
Entering edit mode

You could extract the list of gene names that were matched:

$ awk -F";" '{sub(/gene_name=/,""); print $4;}' gencode.vM18.tss.filtered.bed > matches.txt

Then grep -v to find genes from genes.txt that do not have matches in matches.txt:

$ grep -v -w -F -i -f matches.txt genes.txt > not-found.txt

Then maybe explore not-found.txt to see what tweaks are needed, if any can be done, to relax your search parameters.

ADD REPLY
1
Entering edit mode
7.6 years ago
genecats.ucsc ▴ 580

The general idea is to choose BED as the output format, and from the "get output" page add some padding bases to the "upstream" and "downstream" options. Unfortunately you can't add both upstream and downstream bases at the same time, so you'll have to merge the files when you're done. The txStart field of the gene track you're interested in will be the TSS.

Alternatively, you could choose the selected fields from primary and related tables option, and the select the chrom, txStart, and txEnd fields, and use awk or something similar to add/subtract 2kb to each position, but this gets somewhat tricky for negative strand genes, so the above option is recommended instead.

Please be aware that UCSC has a publicly archived mailing list that you can search to see if your question has been previously answered. Here is an example of a similar question from our archive: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/find$20TSS/genome/567C63DhAo0/9p5vA5PktAYJ

The general link to the mailing list is here, where you can search for and ask UCSC Genome Browser specific questions: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome

Thanks,

ChrisL from the UCSC Genome Browser

ADD COMMENT
0
Entering edit mode

Actually the awk script in the thread below does this really nicely with a simple BED file of all refSeq genes.

A: How to get promoter coordinates of hg19 from UCSC genome browser ?

ADD REPLY
0
Entering edit mode
7.6 years ago
mmmmcandrew ▴ 200

In addition to the answers above, another way to do this is with bedtools slop. You'll need the bed file containing the TSS from UCSC table browser, as well as a genome file. This will add the requested amount of base pairs to each end, accounting for chromosome sizes, etc. using the genome file.

ADD COMMENT

Login before adding your answer.

Traffic: 2085 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6