Question

Script for Gene Clustering.

0

Entering edit mode

8.2 years ago

talalamin • 0

I have multiple genes and their co-ordinates. I want to calculate distance between genes and cluster those genes which are below threshold value. I am trying to it manually. Is there any program or script to do that except bedtools? Thanks

gene cluster distance merge • 2.7k views

ADD COMMENT • link updated 8.2 years ago by Ben ▴ 60 • written 8.2 years ago by talalamin • 0

0

Entering edit mode

Please be elaborative with your question and what kind data you have. Try to put the data snippet. If you want to perform clustering of your genes based on unsupervised hierarchical clustering methods then you can calculate the pairwise distances and plot a dendrogram. you can do that in R. Look for methods like complete linkage or ward.D2. Try to understand how the hclust function works if the intention is to cluster all of them. Now when you say you want to cluster only those genes below a specific threshold, you have to clarify on what basis you are calculating your threshold and why do you want to do that. You clustering will change obviously when you remove observations but is it the correct way to do that? Be more descriptive and then we can help you more. Thanks

ADD REPLY • link 8.2 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

It is unclear what you are asking, it seems though you want to group genes based on their genomic location, like in a gene cluster? What is the purpose of this approach?

What is the distance of genes on different chromosomes? Would it be better to use genetic distance in cM?

ADD REPLY • link 8.2 years ago by Michael 56k

0

Entering edit mode

Yes exactly this is what i want. I want to make a syntenic region. And later on want to apply it to enhancers, so i can compare it later with other genomic location.

ADD REPLY • link 8.2 years ago by talalamin • 0

0

Entering edit mode

Thank you guys for your quick reply. Here is my problem in detail.

I have 50+ genes. And their coordinates (location on chromosome) like

32889611-32974403
33134735-33219529
33304325-33389119

I want to find distance (in terms of location on chromosome) between these genes. And if distance is less than 20000. I want to cluster those genes.

For example

I calculated distance (difference) between Gene2 and Gene1 (32974403 - 33134735) and that is 160332 . So i want to calculate distance (difference) between all genes. In next step i want to take only those genes (original coordinates) who have difference less than 20000 and put them together. For example if distance between 2 or more than 2 genes is within range then place them like this 32889611-32974403---33134735-33219529--33304325-33389119.

I am able to do it manually. But no luck doing it automatically except BedTool. Thanks for your help.

ADD REPLY • link 8.2 years ago by talalamin • 0

0

Entering edit mode

That would be simply the minimum difference of start/end - start/end, I think you can do this in Excel even. What kind of script would you need?

ADD REPLY • link 8.2 years ago by Michael 56k

0

Entering edit mode

Yes I am already doing it manually on Excel. But i want to do it with help of Perl or R. I just want to re-confirm my results generated from Bedtools.

ADD REPLY • link 8.2 years ago by talalamin • 0

0

Entering edit mode

maybe you want to cluster multiple genes with expression levels, not physical distance

ADD REPLY • link 8.2 years ago by Ben ▴ 60

0

Entering edit mode

Gene clusters could be defined on physical or genetic distance, among other things. Whether gene clusters are relevant or even exist can be discussed (see e.g. http://dev.biologists.org/content/134/14/2549 ). To answer this, one could start from determining whether genes are located within a certain distance from each other, however it might be better to compare multiple organisms, and not look at a single organism. Then we look at whether a gene organization is conserved between different organisms.

ADD REPLY • link 8.2 years ago by Michael 56k

score 1 · Answer 1 · 2017-05-25

Possibly, easiest to do in R, along the lines (untested):

# read the matrix into R, depends on your format, export 2 columns to csv in excel
gene.coords = read.csv(...)
gene.coords = t(apply(gene.coords, 1, sort)) # make sure start < stop   
my.gene.dist = apply(gene.coords, 1, function(x) {apply(gene.coords, 1, function(y) min(abs(c( x-y,(x[1]-y[2]), (x[2]-y[1]))) ))}) # get the minimum distance matrix of either start-start, end-end