Annotate .bed file with gene names
3
3
Entering edit mode
10.2 years ago
stevenlang123 ▴ 210

Hi y'all,

I have a .bed file file currently formatted like this:

chr     start   stop
1    14520    14812
1    65409    65725
1    65731    66073
1    69381    69700
1    721281    722042
1    752816    753135

I would like to get something that looks like this (where for overlapping exons, use the exon boundary, unless this boundary was <10bp, in which case expand the probes to include at minimum 10bp of sequence):

chr start   stop    name
1   69090   70008   OR4F5
1   565876  566576  
1   801642  802733  
1   861321  861393  SAMD11
1   865534  865716  SAMD11

Is there a way that I can use UCSC or another tool to accomplish this?

Assembly next-gen genome • 20k views
ADD COMMENT
5
Entering edit mode
10.2 years ago

Grab exons, e.g. via GENCODE:

$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz
$ gunzip --stdout gencode.v21.annotation.gtf.gz \
    | gtf2bed - \
    | grep "exon" \
    > gencode.exons.bed

Then use bedmap to map exon IDs to your regions of interest (roi.bed):

$ bedmap --echo --echo-map-id-uniq roi.bed gencode.exons.bed > answer.bed

If you need to, use awk to pre-process the exons file based on your criteria:

$ awk '{if (($3-$2) > 10) { print $0 } else { print $1"\t"$2"\t"($2+10)"\t"$4}}' gencode.exons.bed > expanded.exons.bed

Then map on that result.

ADD COMMENT
0
Entering edit mode

Improving this with a pipe from curl to gunzip (without writing gtf.gz):

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_21/gencode.21.annotation.gtf.gz" |
   gunzip -c |
   gtf2bed - |
   <your code>

:-)

ADD REPLY
0
Entering edit mode

Thanks for the feedback. Sometimes it is useful to keep the annotations file, but keep the compressed version to save disk space, and extracting it only as needed to do analyses. Particularly, network access to download a large file can be a costly part of analyses, in terms of time, especially repeating it unnecessarily. In any case, there are lots of ways to use wget or curl to follow either approach.

ADD REPLY
4
Entering edit mode
10.2 years ago
Manvendra Singh ★ 2.2k

You can download the bed files of GTF from UCSC carrying gene names in separate column.

Once you got the bed file (or download GTF and convert it into bed format).

then you should extend your bed file with 10 basepairs by some awk command

awk '{ print $1,$2-10,$3+10}' OFS="\t" your_file.bed > new_file.bed

(if you have strand info then should do it taking strands care of)

then intersect with downloaded file

bedtools intersect -a new_file.bed -b downloaded_file.bed -f 1 > results.bed
ADD COMMENT
0
Entering edit mode

Thank you very much!!

ADD REPLY
0
Entering edit mode

If you separated gtf and bed files for each strand (+ and -) is there a way to merge the resulting files together? Also what is there are multiple features that match your new_file? I am interested in also getting 'exon, intron, 3'UTR, 5'UTR' features too.

ADD REPLY
0
Entering edit mode
7.5 years ago
windsur ▴ 20

Is it possible do the same thing from a bam file? I generated the bed file using bamtobed but I do not know what option should I choice to get the name of the gene and the exon. thanks!

ADD COMMENT

Login before adding your answer.

Traffic: 3832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6