I have multiple genes and their co-ordinates. I want to calculate distance between genes and cluster those genes which are below threshold value. I am trying to it manually. Is there any program or script to do that except bedtools? Thanks
Please be elaborative with your question and what kind data you have. Try to put the data snippet. If you want to perform clustering of your genes based on unsupervised hierarchical clustering methods then you can calculate the pairwise distances and plot a dendrogram. you can do that in R. Look for methods like complete linkage or ward.D2. Try to understand how the hclust function works if the intention is to cluster all of them. Now when you say you want to cluster only those genes below a specific threshold, you have to clarify on what basis you are calculating your threshold and why do you want to do that. You clustering will change obviously when you remove observations but is it the correct way to do that? Be more descriptive and then we can help you more. Thanks
It is unclear what you are asking, it seems though you want to group genes based on their genomic location, like in a gene cluster?
What is the purpose of this approach?
What is the distance of genes on different chromosomes? Would it be better to use genetic distance in cM?
Yes exactly this is what i want. I want to make a syntenic region. And later on want to apply it to enhancers, so i can compare it later with other genomic location.
Thank you guys for your quick reply. Here is my problem in detail.
I have 50+ genes. And their coordinates (location on chromosome) like
32889611-32974403
33134735-33219529
33304325-33389119
I want to find distance (in terms of location on chromosome) between these genes. And if distance is less than 20000. I want to cluster those genes.
For example
I calculated distance (difference) between Gene2 and Gene1 (32974403 - 33134735) and that is 160332 . So i want to calculate distance (difference) between all genes. In next step i want to take only those genes (original coordinates) who have difference less than 20000 and put them together. For example if distance between 2 or more than 2 genes is within range then place them like this 32889611-32974403---33134735-33219529--33304325-33389119.
I am able to do it manually. But no luck doing it automatically except BedTool.
Thanks for your help.
Gene clusters could be defined on physical or genetic distance, among other things. Whether gene clusters are relevant or even exist can be discussed (see e.g. http://dev.biologists.org/content/134/14/2549 ). To answer this, one could start from determining whether genes are located within a certain distance from each other, however it might be better to compare multiple organisms, and not look at a single organism. Then we look at whether a gene organization is conserved between different organisms.
Possibly, easiest to do in R, along the lines (untested):
# read the matrix into R, depends on your format, export 2 columns to csv in excel
gene.coords = read.csv(...)
gene.coords = t(apply(gene.coords, 1, sort)) # make sure start < stop
my.gene.dist = apply(gene.coords, 1, function(x) {apply(gene.coords, 1, function(y) min(abs(c( x-y,(x[1]-y[2]), (x[2]-y[1]))) ))}) # get the minimum distance matrix of either start-start, end-end
Please be elaborative with your question and what kind data you have. Try to put the data snippet. If you want to perform clustering of your genes based on unsupervised hierarchical clustering methods then you can calculate the pairwise distances and plot a dendrogram. you can do that in
R
. Look for methods likecomplete linkage or ward.D2
. Try to understand how thehclust
function works if the intention is to cluster all of them. Now when you say you want to cluster only those genes below a specific threshold, you have to clarify on what basis you are calculating your threshold and why do you want to do that. You clustering will change obviously when you remove observations but is it the correct way to do that? Be more descriptive and then we can help you more. ThanksIt is unclear what you are asking, it seems though you want to group genes based on their genomic location, like in a gene cluster? What is the purpose of this approach?
What is the distance of genes on different chromosomes? Would it be better to use genetic distance in cM?
Yes exactly this is what i want. I want to make a syntenic region. And later on want to apply it to enhancers, so i can compare it later with other genomic location.
Thank you guys for your quick reply. Here is my problem in detail.
I have 50+ genes. And their coordinates (location on chromosome) like
I want to find distance (in terms of location on chromosome) between these genes. And if distance is less than 20000. I want to cluster those genes.
I am able to do it manually. But no luck doing it automatically except BedTool. Thanks for your help.
That would be simply the minimum difference of start/end - start/end, I think you can do this in Excel even. What kind of script would you need?
Yes I am already doing it manually on Excel. But i want to do it with help of Perl or R. I just want to re-confirm my results generated from Bedtools.
maybe you want to cluster multiple genes with expression levels, not physical distance
Gene clusters could be defined on physical or genetic distance, among other things. Whether gene clusters are relevant or even exist can be discussed (see e.g. http://dev.biologists.org/content/134/14/2549 ). To answer this, one could start from determining whether genes are located within a certain distance from each other, however it might be better to compare multiple organisms, and not look at a single organism. Then we look at whether a gene organization is conserved between different organisms.