Hi,
Is there a place where I can retrieve the distance between two genes on the same chromosome? I have a list of 100 genes, so it would be nice to retrieve this information from a database.
Hi,
Is there a place where I can retrieve the distance between two genes on the same chromosome? I have a list of 100 genes, so it would be nice to retrieve this information from a database.
You could just obtain the coordinates of those genes and then do a simple arithmetic operation: max(start_gene1, start_gene2)-min(end_gene1,end_gene2)
, assuming start
is the lowest coordinate, disregarding the strand orientation (end
would be the real start of a gene located in the minus strand). In case genes are overlapping you will get a negative number.
You are in luck, I just wrote a short script to do that with olfactory receptor genes from the mouse genome. I got some fused genes since they are close and with similar sequences.
grep -f listOlfrGene.txt
where listOlfrGene.txt
contains Ensembl transcript ids gathered from Biomart (based on a GO term search for olfactory receptor function)bedtools sort bedFile > olfr_genes_sorted.bed
(http://bedtools.readthedocs.org/en/latest/index.html)bedtools closest -s -d -io -N -a olfr_genes_sorted.bed -b olfr_genes_sorted.bed > output.bed
. This gets me a new bed file in the format gene #1 bed data | closest gene #2 bed data | distance between #1 and #2. Here the closest gene has to be distinct, on the same strand of the same chromosome and not overlapping (-s -d -io -N
options, read the manual).awk '{print $NF,"\t",$1,"\t",$4,"\t",$10}' output.bed > closestOlfrGenes.txt
to get the data in the distance | chromosome | geneID #1 | geneID #2
format (which I find more convenient)sort -n closestOlfrGenes.txt | awk '$1 > 0 {print $0}' > sortedClosestOlfrGenes.txt
gets me the values sorted by distance. I use the awk part to get rid of a couple values that were at -10 for some reasons.You have here a sample from each file http://pastebin.com/dMh7MQUU. Note that the end results is such that you will find paired lines in this format: distanceX gene1 gene2 \n distanceX gene2 gene1 \n
For visualization, with the results here (http://imgur.com/caNxDew):
library(dplyr)
closestOlfr=read.csv(file="sortedClosestOlfrGenes.txt",sep="",header=FALSE,na.strings = ".",col.names=c("dist","chr","gene","closest"))
closestOlfr$dist=closestOlfr$dist/1000 # convert to kb
h<-hist(closestOlfr$dist[closestOlfr$dist<=100], breaks=100, col="red", xlab="Distance to closest olfactory gene (kb)", main="Relative proximity of olfactory genes (cut-off at 100kb)")
dist_wanted=20
print(c("For this threshold (kb):",dist_wanted,"here is the number of close genes",sum(closestOlfr$dist<=dist_wanted)))
I conclude that RNA-Seq alignment with a maximum intron size of 25000 are still too high.
Hello, you can download the genomic coordinates of your genes (e.g. from BioMart), sort the list according to chromosomal location and then measure the distances via scripting.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Care to give some examples?