Question

Annotating bed file with gene names

1

Entering edit mode

8.7 years ago

onspotproductions ▴ 150

I generated a bed file form some splicing data and attempted to use both bed tools intersect and bedtools annotate with a master list from UCSC. I have not been able to get any annotations. Below are examples of the data.

chr21   40801057    40801137        0   -   40801057    40801137
chr21   34953608    34953806        0   -   34953608    34953806
chr21   45370568    45370755        0   +   45370568    45370755
chr21   37692487    37692601        0   +   37692487    37692601
chr21   38528957    38529208        0   +   38528957    38529208
chr21   35191545    35191627        0   +   35191545    35191627
chr21   30565815    30566581        0   +   30565815    30566581

From UCSC

chr1    11873   14409   uc001aaa.3  0   +   11873   11873   0   3   354,109,1189,   0,739,1347,
chr1    11873   14409   uc010nxr.1  0   +   11873   11873   0   3   354,52,1189,    0,772,1347,
chr1    11873   14409   uc010nxq.1  0   +   12189   13639   0   3   354,127,1007,   0,721,1529,
chr1    14361   16765   uc009vis.3  0   -   14361   14361   0   4   468,69,147,159, 0,608,1434,2245,
chr1    16857   17751   uc009vjc.1  0   -   16857   16857   0   2   198,519,    0,375,

RNA-Seq bed annotate gene • 17k views

ADD COMMENT • link updated 8.7 years ago by chen ★ 2.5k • written 8.7 years ago by onspotproductions ▴ 150

0

Entering edit mode

Can you just select the data and press Ctrl+k

ADD REPLY • link 8.7 years ago by venu 7.1k

0

Entering edit mode

Why not? maybe the data doesn't intersect! Also, what is the full command that you used?

ADD REPLY • link 8.7 years ago by Sukhi Singh 11k

0

Entering edit mode

likely possible that your project data might not correspond to that of the UCSC data but for others to give an idea please try to format the question as venu said and then we can play with the to bed format files. But having said that are both the bed files generated from the same master source? Please mention what is the first bed input and the second bed input after formatting and then if they are properly tab delimited the intersect should run fine and give some output unless there is no intersection.

ADD REPLY • link 8.7 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Both are from UCSC hg19. The first was manually created using output data from MISO, specifically events name column. This was then separated into tab columns that conformed with bed format. I primarily need to add gene symbols to this data because we are only provided with the exons that underwent splicing and not which genes were splice regulated.

ADD REPLY • link 8.7 years ago by onspotproductions ▴ 150

0

Entering edit mode

First of all your input file examples will not have any intersection as the bed intersect is dependent on first 3 tab columns of both input files and in your example I do not see any match as the data is having chr 21 in one case and in the other you have chr 1, try to see in both the files with a grep if you have all chromosomes or not and then do the bedtools intersect using -wa -wb to get the other desired columns. However you do not have gene symbols in any of the file so you cannot have them with bedtools intersect, in any case once you have the intersection you can annotate the intersected bed file with ensemble or you can download the hg19 UCSC bed file with gene symbols and awk the file having chr, start, end , refseq_id, gene symbols and then intersect with bedtools of the first file and see if there is any intersection or not.

ADD REPLY • link 8.7 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

What was the output of bedtools intersect? An empty file, or intersection without the annotation?

ADD REPLY • link 8.7 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Run the dos2unix utility on both the files. Make sure there is no header in the file downloaded from UCSC. Then try using bedtools intersect.

ADD REPLY • link 8.7 years ago by Sandeep ▴ 260

score 4 · Answer 1 · 2016-04-19

4

Entering edit mode

8.7 years ago

Alex Reynolds 36k

Sort the two inputs:

$ sort-bed my-regions.unsorted.bed > my-regions.bed
$ sort-bed ucsc-regions.unsorted.bed > ucsc-regions.bed

Then map the UCSC regions to your regions of interest with bedmap. If you only need gene IDs from the UCSC regions, use the --echo-map-id-uniq option:

$ bedmap --echo --echo-map-id-uniq --delim '\t' my-regions ucsc-regions.bed > answer.bed

ADD COMMENT • link 8.7 years ago by Alex Reynolds 36k

0

Entering edit mode

This worked for what I wanted to do thank you.

ADD REPLY • link 8.7 years ago by onspotproductions ▴ 150

score 2 · Answer 2 · 2016-04-20

Locating genes by positions is very easy by using OpenGene, which queries the gene name by chromosome position from gencode database.

Here gives an example with only 3 code lines:

using OpenGene, OpenGene.Reference

index = gencode_load("GRCh37")

gencode_locate(index, "chr5", 149526621)

# it will return
# 1-element Array{Any,1}:
#  Dict{ASCIIString,Any}("gene"=>"PDGFRB","number"=>1,"transcript"=>"ENST00000261799.4","type"=>"intron")

You can loop the lines in the bed file, get their genes and make a new bed file with gene names located

score 1 · Answer 3 · 2016-04-19

1

Entering edit mode

8.7 years ago

ablanchetcohen ★ 1.2k

I always use the script annotatePeaks.pl from Homer. when I want to annotate a BED file with the gene annotation. Simple, quick, flexible and effective. I've tried other solutions but always go back to annotatePeaks.pl.

http://homer.salk.edu/homer/ngs/annotation.html

ADD COMMENT • link 8.7 years ago by ablanchetcohen ★ 1.2k

0

Entering edit mode

I didn't think about using homer, I used it before and it worked well. It did exactly what I need it to. Thank you.

ADD REPLY • link 8.7 years ago by onspotproductions ▴ 150