Question

How can find a list of genes are neighbour?

1

Entering edit mode

9.7 years ago

m.koohi.m ▴ 120

Suppose I have Protein1, Protein2, Protein3 that are expressed by Gene1, Gene2, Gene3. I want to know Gene1, Gene2, Gene3 in the genome sequence are neighbor or not?

Genome: ========Gene1=====Gene2=======Gene3======

proteins genes sequence • 5.3k views

ADD COMMENT • link updated 9.7 years ago by Giovanni M Dall'Olio 28k • written 9.7 years ago by m.koohi.m ▴ 120

1

Entering edit mode

what do you mean with "neighbour"?

ADD REPLY • link 9.7 years ago by Martombo ★ 3.2k

0

Entering edit mode

Suppose I have Protein1, Protein2, Protein3 that are expressed by Gene1, Gene2, Gene3. I want to know Gene1, Gene2, Gene3 in the genome sequence are neighbor or not?

Genome: ========Gene1=====Gene2=======Gene3======

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by m.koohi.m ▴ 120

4

Entering edit mode

9.7 years ago

Giovanni M Dall'Olio 28k

You can do it in R:

> biocLite('Homo.sapiens')
> library(Homo.sapiens)
> human.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> data.frame(
   gene = human.genes$gene_id, 
   neigh.upstream = human.genes$gene_id[precede(human.genes, human.genes)], 
   neigh.downstream=human.genes$gene_id[follow(human.genes, human.genes)]
   )
       gene neigh.d neigh2.u
1         1  79673 116412
2        10  63898      9
3       100  10955  79015
4      1000  83539   1825
5     10000   9859    159
6 100008586 645037  26749

# Let's verify the first gene:
> subset(human.genes, gene_id %in% c(1, 79673, 162968))
GRanges object with 3 ranges and 1 metadata column:
         seqnames               ranges strand |     gene_id
            <Rle>            <IRanges>  <Rle> | <character>
       1    chr19 [58858172, 58874214]      - |           1
  162968    chr19 [58865723, 58874214]      - |      162968
   79673    chr19 [58637619, 58666477]      - |       79673

This will give you a dataframe with the upstream and downstream neighbor of each human gene. You can simply intersect with your list to identify if two genes are neighbors.

Note that the follow() and precede() functions ignore overlapping genes. If you want to reduce the problem, you can use the resize() function to create two new GRanges objects, one with only the start/end of the gene, and calculate the neighbors from there:

> human.genes.start = resize(human.genes, 1, fix='start')
> human.genes.end = resize(human.genes, 1, fix='end')
> data.frame(
   gene = human.genes$gene_id, 
   neigh1 = human.genes$gene_id[precede(human.genes.start, human.genes)], 
   neigh2=human.genes$gene_id[follow(human.genes.end, human.genes)]
   )

Have a look at the nearest(), distanceToNearest() functions as well.

EDIT: I have been thinking about the problem of overlapping genes during these days. I think the easiest way to handle them is to reduce() the gene coordinates first, merging all overlapping genes in one unit:

> human.genes.reduced = reduce(human.genes, with.revmap=T)
GRanges object with 20884 ranges and 1 metadata column:
                seqnames           ranges strand   |         revmap
                   <Rle>        <IRanges>  <Rle>   |  <IntegerList>
      [1]           chr1 [ 11874,  14409]      +   |            581
      [2]           chr1 [ 69091,  70008]      +   |          19514
      [3]           chr1 [762971, 794826]      +   |          17162
      [4]           chr1 [860530, 879961]      +   |           4936
      [5]           chr1 [895967, 901099]      +   |          10174
      ...            ...              ...    ... ...            ...
  [20880] chrUn_gl000213 [108007, 139339]      -   |            652
  [20881] chrUn_gl000218 [ 38786,  97454]      -   |      546,11344
  [20882] chrUn_gl000219 [ 53809,  99642]      -   |           8941
  [20883] chrUn_gl000220 [ 97129, 126696]      +   |           1400
  [20884] chrUn_gl000228 [ 79448, 124171]      +   | 18730,1868,642
  -------

The with.revmap option allows you to keep the original gene ids. In the example, you can see that the genes 18730,1868,642 are overlapping, and therefore have been merged to a single unit.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hi. I am a Ph.D. student studying biomedicine. I do not know how to identify the nearby (neighboring) genes of one gene of interest. I just have a little background of R. With the above code, if I want to know the nearby genes of a specific gene (e.g. BAIAP2-DT, Gene ID: 440465), can you please show me how I should use your R codes? Thanks in advance.

ADD REPLY • link 6.5 years ago by haitang.yang • 0

1

Entering edit mode

9.7 years ago

Alternative ▴ 290

This depends on you definition of neighbors. From a bed file with gene coordinates, you can use "Bedtools closest" or "Bedtools window" in order to look around each feature and detect neighboring features. Many other tools allow you to do such things like "BEDOPS".

ADD COMMENT • link 9.7 years ago by Alternative ▴ 290

score 2 · Accepted Answer · 2015-12-08

2

Entering edit mode

9.7 years ago

yuhao819 ▴ 20

I suggest you to sort your gff file according to scaffold (chromesome) firstly, and coordinate secondly. they sit at the first column and fourth column, repectively.

REF: http://asia.ensembl.org/info/website/upload/gff.html?redirect=no

ADD COMMENT • link 9.7 years ago by yuhao819 ▴ 20