Suppose I have Protein1, Protein2, Protein3 that are expressed by Gene1, Gene2, Gene3. I want to know Gene1, Gene2, Gene3 in the genome sequence are neighbor or not?
Genome: ========Gene1=====Gene2=======Gene3======
Suppose I have Protein1, Protein2, Protein3 that are expressed by Gene1, Gene2, Gene3. I want to know Gene1, Gene2, Gene3 in the genome sequence are neighbor or not?
Genome: ========Gene1=====Gene2=======Gene3======
I suggest you to sort your gff file according to scaffold (chromesome) firstly, and coordinate secondly. they sit at the first column and fourth column, repectively.
REF: http://asia.ensembl.org/info/website/upload/gff.html?redirect=no
You can do it in R:
> biocLite('Homo.sapiens')
> library(Homo.sapiens)
> human.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> data.frame(
gene = human.genes$gene_id,
neigh.upstream = human.genes$gene_id[precede(human.genes, human.genes)],
neigh.downstream=human.genes$gene_id[follow(human.genes, human.genes)]
)
gene neigh.d neigh2.u
1 1 79673 116412
2 10 63898 9
3 100 10955 79015
4 1000 83539 1825
5 10000 9859 159
6 100008586 645037 26749
# Let's verify the first gene:
> subset(human.genes, gene_id %in% c(1, 79673, 162968))
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
1 chr19 [58858172, 58874214] - | 1
162968 chr19 [58865723, 58874214] - | 162968
79673 chr19 [58637619, 58666477] - | 79673
This will give you a dataframe with the upstream and downstream neighbor of each human gene. You can simply intersect with your list to identify if two genes are neighbors.
Note that the follow()
and precede()
functions ignore overlapping genes. If you want to reduce the problem, you can use the resize()
function to create two new GRanges objects, one with only the start/end of the gene, and calculate the neighbors from there:
> human.genes.start = resize(human.genes, 1, fix='start')
> human.genes.end = resize(human.genes, 1, fix='end')
> data.frame(
gene = human.genes$gene_id,
neigh1 = human.genes$gene_id[precede(human.genes.start, human.genes)],
neigh2=human.genes$gene_id[follow(human.genes.end, human.genes)]
)
Have a look at the nearest()
, distanceToNearest()
functions as well.
EDIT: I have been thinking about the problem of overlapping genes during these days. I think the easiest way to handle them is to reduce()
the gene coordinates first, merging all overlapping genes in one unit:
> human.genes.reduced = reduce(human.genes, with.revmap=T)
GRanges object with 20884 ranges and 1 metadata column:
seqnames ranges strand | revmap
<Rle> <IRanges> <Rle> | <IntegerList>
[1] chr1 [ 11874, 14409] + | 581
[2] chr1 [ 69091, 70008] + | 19514
[3] chr1 [762971, 794826] + | 17162
[4] chr1 [860530, 879961] + | 4936
[5] chr1 [895967, 901099] + | 10174
... ... ... ... ... ...
[20880] chrUn_gl000213 [108007, 139339] - | 652
[20881] chrUn_gl000218 [ 38786, 97454] - | 546,11344
[20882] chrUn_gl000219 [ 53809, 99642] - | 8941
[20883] chrUn_gl000220 [ 97129, 126696] + | 1400
[20884] chrUn_gl000228 [ 79448, 124171] + | 18730,1868,642
-------
The with.revmap
option allows you to keep the original gene ids. In the example, you can see that the genes 18730,1868,642 are overlapping, and therefore have been merged to a single unit.
Hi. I am a Ph.D. student studying biomedicine. I do not know how to identify the nearby (neighboring) genes of one gene of interest. I just have a little background of R. With the above code, if I want to know the nearby genes of a specific gene (e.g. BAIAP2-DT, Gene ID: 440465), can you please show me how I should use your R codes? Thanks in advance.
This depends on you definition of neighbors. From a bed file with gene coordinates, you can use "Bedtools closest" or "Bedtools window" in order to look around each feature and detect neighboring features. Many other tools allow you to do such things like "BEDOPS".
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
what do you mean with "neighbour"?
Suppose I have Protein1, Protein2, Protein3 that are expressed by Gene1, Gene2, Gene3. I want to know Gene1, Gene2, Gene3 in the genome sequence are neighbor or not?