Annotating bed file with gene names
3
1
Entering edit mode
8.7 years ago

I generated a bed file form some splicing data and attempted to use both bed tools intersect and bedtools annotate with a master list from UCSC. I have not been able to get any annotations. Below are examples of the data.

chr21   40801057    40801137        0   -   40801057    40801137
chr21   34953608    34953806        0   -   34953608    34953806
chr21   45370568    45370755        0   +   45370568    45370755
chr21   37692487    37692601        0   +   37692487    37692601
chr21   38528957    38529208        0   +   38528957    38529208
chr21   35191545    35191627        0   +   35191545    35191627
chr21   30565815    30566581        0   +   30565815    30566581

From UCSC

chr1    11873   14409   uc001aaa.3  0   +   11873   11873   0   3   354,109,1189,   0,739,1347,
chr1    11873   14409   uc010nxr.1  0   +   11873   11873   0   3   354,52,1189,    0,772,1347,
chr1    11873   14409   uc010nxq.1  0   +   12189   13639   0   3   354,127,1007,   0,721,1529,
chr1    14361   16765   uc009vis.3  0   -   14361   14361   0   4   468,69,147,159, 0,608,1434,2245,
chr1    16857   17751   uc009vjc.1  0   -   16857   16857   0   2   198,519,    0,375,
RNA-Seq bed annotate gene • 17k views
ADD COMMENT
0
Entering edit mode

Can you just select the data and press Ctrl+k

ADD REPLY
0
Entering edit mode

Why not? maybe the data doesn't intersect! Also, what is the full command that you used?

ADD REPLY
0
Entering edit mode

likely possible that your project data might not correspond to that of the UCSC data but for others to give an idea please try to format the question as venu said and then we can play with the to bed format files. But having said that are both the bed files generated from the same master source? Please mention what is the first bed input and the second bed input after formatting and then if they are properly tab delimited the intersect should run fine and give some output unless there is no intersection.

ADD REPLY
0
Entering edit mode

Both are from UCSC hg19. The first was manually created using output data from MISO, specifically events name column. This was then separated into tab columns that conformed with bed format. I primarily need to add gene symbols to this data because we are only provided with the exons that underwent splicing and not which genes were splice regulated.

ADD REPLY
0
Entering edit mode

First of all your input file examples will not have any intersection as the bed intersect is dependent on first 3 tab columns of both input files and in your example I do not see any match as the data is having chr 21 in one case and in the other you have chr 1, try to see in both the files with a grep if you have all chromosomes or not and then do the bedtools intersect using -wa -wb to get the other desired columns. However you do not have gene symbols in any of the file so you cannot have them with bedtools intersect, in any case once you have the intersection you can annotate the intersected bed file with ensemble or you can download the hg19 UCSC bed file with gene symbols and awk the file having chr, start, end , refseq_id, gene symbols and then intersect with bedtools of the first file and see if there is any intersection or not.

ADD REPLY
0
Entering edit mode

What was the output of bedtools intersect? An empty file, or intersection without the annotation?

ADD REPLY
0
Entering edit mode

Run the dos2unix utility on both the files. Make sure there is no header in the file downloaded from UCSC. Then try using bedtools intersect.

ADD REPLY
4
Entering edit mode
8.7 years ago

Sort the two inputs:

$ sort-bed my-regions.unsorted.bed > my-regions.bed
$ sort-bed ucsc-regions.unsorted.bed > ucsc-regions.bed

Then map the UCSC regions to your regions of interest with bedmap. If you only need gene IDs from the UCSC regions, use the --echo-map-id-uniq option:

$ bedmap --echo --echo-map-id-uniq --delim '\t' my-regions ucsc-regions.bed > answer.bed
ADD COMMENT
0
Entering edit mode

This worked for what I wanted to do thank you.

ADD REPLY
2
Entering edit mode
8.7 years ago
chen ★ 2.5k

Locating genes by positions is very easy by using OpenGene, which queries the gene name by chromosome position from gencode database.

Here gives an example with only 3 code lines:

using OpenGene, OpenGene.Reference

index = gencode_load("GRCh37")

gencode_locate(index, "chr5", 149526621)

# it will return
# 1-element Array{Any,1}:
#  Dict{ASCIIString,Any}("gene"=>"PDGFRB","number"=>1,"transcript"=>"ENST00000261799.4","type"=>"intron")

You can loop the lines in the bed file, get their genes and make a new bed file with gene names located

ADD COMMENT
1
Entering edit mode
8.7 years ago
ablanchetcohen ★ 1.2k

I always use the script annotatePeaks.pl from Homer. when I want to annotate a BED file with the gene annotation. Simple, quick, flexible and effective. I've tried other solutions but always go back to annotatePeaks.pl.

http://homer.salk.edu/homer/ngs/annotation.html

ADD COMMENT
0
Entering edit mode

I didn't think about using homer, I used it before and it worked well. It did exactly what I need it to. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6