Question

Quick way to annotate a target exome BED file with gene name information

0

Entering edit mode

5.1 years ago

lakhujanivijay 5.9k

I have a whole exome target BED file which looks something like this

1   879287  879533
1   880073  880180
1   880436  880526
1   880897  881033
1   881552  881666

I want to annotate this file with gene name information something like this:

1   879287  879533  gene A
1   880073  880180  gene B
1   880436  880526  gene C
1   880897  881033  gene D
1   881552  881666  gene E

I could see several posts on biostars which I found quite close to a real solution. Can someone point to a working one?

What I tried is downloading a BED format file from UCSC table browser and intersecting that with my BED file using bedtools intersect however several regions are missing.

exon bed annotate gene • 3.6k views

ADD COMMENT • link 5.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

What I tried is downloading a BED format file from UCSC table browser and intersecting that with my BED file using bedtools intersect however several regions are missing.

how do you know it ? show us the mismatching lines please . Check the files are sorted and the chromsomes names are the same ('1'!='chr1')

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

What I did was

Downloaded GTF from Ensembl FTP (Homo_sapiens.GRCh37.87.gtf)

Then I made a genes.bed file from it

awk -F "\t" '$3=="gene" {print $1 "\t"  $4 "\t" $5 "\t" $9 }' Homo_sapiens.GRCh37.87.gtf | sed 's/;/\t/g' | awk -F "\t" '{print $1 "\t"  $2 "\t" $3 "\t" $6}' | sed 's/gene_name "//g' | sed 's/"//g' | grep -v GL  > genes.bed

Now I am intersecting this file with my target_exome.bed file and there are around 1200 regions which do not overlap to any gene

bedtools intersect -a Exome_Target_hg19.bed -b genes.bed -C  | awk -F "\t" '$4==0{print}' | head

1   1262290 1262412 0
1   1262620 1263143 0
1   32042749    32043059    0
1   32044804    32044868    0
1   32048749    32048842    0
1   32049061    32049176    0
1   32050278    32050402    0
1   32050486    32050637    0
1   32050751    32050941    0
1   32051040    32051086    0

ADD REPLY • link 5.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

{print $1 "\t"  $4 "\t" $5 "\t" $9 }'

should be

{print $1 "\t"  (int($4)-1) "\t" $5 "\t" $9 }'

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Thanks Pierre Lindenbaum for the edit; however, the problem remains as it is

ADD REPLY • link 5.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

lakhujanivijay, BEDTools will do this via any GTF from GENCODE, Ensembl, NCBI, etc. You could also obtain annotation from UCSC and use that. BEDTools has many different functions and parameters - you'll get the right combination eventually.

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Is this BED file based on GENCODE/Ensembl? This could explain why certain genes are missing since GENCODE/Ensembl is more comprehensive than RefSeq.

ADD REPLY • link 5.1 years ago by ATpoint 87k

0

Entering edit mode

https://github.com/imgag/ngs-bits/blob/master/doc/tools/BedAnnotateGenes.md - however it takes some time to install the database.

Many R packages do gene-based annotation.

ADD REPLY • link 5.1 years ago by German.M.Demidov ★ 3.0k

0

Entering edit mode

Are you looking for gene symbols (MTOR), or Ensembl identifiers (ENSG*)?

We should be able to help you using the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) or Data Integrator (http://genome.ucsc.edu/cgi-bin/hgIntegrator). There are also computational solutions using the mysql server or premade gtf files (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/) depending on what output you are looking for.

If you email us your question with your attached file to genome@soe.ucsc.edu we can take a look. Or if the file is too large, then the 1200 regions that don't match, as well as an example of your desired output.

ADD REPLY • link 5.1 years ago by Luis Nassar ▴ 670