Entering edit mode
4.8 years ago
lakhujanivijay
5.9k
I have a whole exome target BED file which looks something like this
1 879287 879533
1 880073 880180
1 880436 880526
1 880897 881033
1 881552 881666
I want to annotate this file with gene name information something like this:
1 879287 879533 gene A
1 880073 880180 gene B
1 880436 880526 gene C
1 880897 881033 gene D
1 881552 881666 gene E
I could see several posts on biostars which I found quite close to a real solution. Can someone point to a working one?
What I tried is downloading a BED format file from UCSC table browser and intersecting that with my BED file using bedtools intersect
however several regions are missing.
how do you know it ? show us the mismatching lines please . Check the files are sorted and the chromsomes names are the same ('1'!='chr1')
What I did was
Downloaded GTF from Ensembl FTP (Homo_sapiens.GRCh37.87.gtf)
Then I made a
genes.bed
file from itNow I am intersecting this file with my
target_exome.bed
file and there are around 1200 regions which do not overlap to any geneshould be
Thanks Pierre Lindenbaum for the edit; however, the problem remains as it is
lakhujanivijay, BEDTools will do this via any GTF from GENCODE, Ensembl, NCBI, etc. You could also obtain annotation from UCSC and use that. BEDTools has many different functions and parameters - you'll get the right combination eventually.
Is this BED file based on GENCODE/Ensembl? This could explain why certain genes are missing since GENCODE/Ensembl is more comprehensive than RefSeq.
https://github.com/imgag/ngs-bits/blob/master/doc/tools/BedAnnotateGenes.md - however it takes some time to install the database.
Many R packages do gene-based annotation.
Are you looking for gene symbols (MTOR), or Ensembl identifiers (ENSG*)?
We should be able to help you using the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) or Data Integrator (http://genome.ucsc.edu/cgi-bin/hgIntegrator). There are also computational solutions using the mysql server or premade gtf files (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/) depending on what output you are looking for.
If you email us your question with your attached file to genome@soe.ucsc.edu we can take a look. Or if the file is too large, then the 1200 regions that don't match, as well as an example of your desired output.