Question

Annotate .bed file with gene names

3

Entering edit mode

10.7 years ago

stevenlang123 ▴ 210

Hi y'all,

I have a .bed file file currently formatted like this:

chr     start   stop
1    14520    14812
1    65409    65725
1    65731    66073
1    69381    69700
1    721281    722042
1    752816    753135

I would like to get something that looks like this (where for overlapping exons, use the exon boundary, unless this boundary was <10bp, in which case expand the probes to include at minimum 10bp of sequence):

chr start   stop    name
1   69090   70008   OR4F5
1   565876  566576  
1   801642  802733  
1   861321  861393  SAMD11
1   865534  865716  SAMD11

Is there a way that I can use UCSC or another tool to accomplish this?

Assembly next-gen genome • 21k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by stevenlang123 ▴ 210

Ram · Answer 1 · 2014-10-27

5

Entering edit mode

10.7 years ago

Alex Reynolds 36k

Grab exons, e.g. via GENCODE:

$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz
$ gunzip --stdout gencode.v21.annotation.gtf.gz \
    | gtf2bed - \
    | grep "exon" \
    > gencode.exons.bed

Then use bedmap to map exon IDs to your regions of interest (roi.bed):

$ bedmap --echo --echo-map-id-uniq roi.bed gencode.exons.bed > answer.bed

If you need to, use awk to pre-process the exons file based on your criteria:

$ awk '{if (($3-$2) > 10) { print $0 } else { print $1"\t"$2"\t"($2+10)"\t"$4}}' gencode.exons.bed > expanded.exons.bed

Then map on that result.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Improving this with a pipe from curl to gunzip (without writing gtf.gz):

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_21/gencode.21.annotation.gtf.gz" |
   gunzip -c |
   gtf2bed - |
   <your code>

:-)

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by PoGibas 5.1k

0

Entering edit mode

Thanks for the feedback. Sometimes it is useful to keep the annotations file, but keep the compressed version to save disk space, and extracting it only as needed to do analyses. Particularly, network access to download a large file can be a costly part of analyses, in terms of time, especially repeating it unnecessarily. In any case, there are lots of ways to use wget or curl to follow either approach.

ADD REPLY • link 10.7 years ago by Alex Reynolds 36k

Ram · Answer 2 · 2014-10-26

4

Entering edit mode

10.7 years ago

Manvendra Singh ★ 2.2k

You can download the bed files of GTF from UCSC carrying gene names in separate column.

Once you got the bed file (or download GTF and convert it into bed format).

then you should extend your bed file with 10 basepairs by some awk command

awk '{ print $1,$2-10,$3+10}' OFS="\t" your_file.bed > new_file.bed

(if you have strand info then should do it taking strands care of)

then intersect with downloaded file

bedtools intersect -a new_file.bed -b downloaded_file.bed -f 1 > results.bed

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

Thank you very much!!

ADD REPLY • link 10.7 years ago by stevenlang123 ▴ 210

0

Entering edit mode

If you separated gtf and bed files for each strand (+ and -) is there a way to merge the resulting files together? Also what is there are multiple features that match your new_file? I am interested in also getting 'exon, intron, 3'UTR, 5'UTR' features too.

ADD REPLY • link 7.0 years ago by CrisMar ▴ 80

score 0 · Answer 3 · 2017-07-19

0

Entering edit mode

8.0 years ago

windsur ▴ 20

Is it possible do the same thing from a bam file? I generated the bed file using bamtobed but I do not know what option should I choice to get the name of the gene and the exon. thanks!

ADD COMMENT • link 8.0 years ago by windsur ▴ 20