I have the UCSC RefSeq track with txstart sites for all the genes. I want to find the distance between consecutive pairs of genes in order to determine which consecutive genes lie within a particular range of each other. How can I do this using Python or R ?
You could probably do this relatively easily in R with GenomicRanges. Read in the RefSeq track, convert the transcripts to GRanges and then apply a function that computes the distance between a range and the output from nearest().
Edit: Here's an example:
library("GenomicRanges")
refGene <- read.delim("~/Downloads/refGene.txt", header=T)
gr <- GRanges(seqnames=Rle(refGene$chrom),
ranges=IRanges(start=refGene$txStart, end=refGene$txEnd, names=refGene$name),
strand=Rle(strand(refGene$strand)))
neighbors <- nearest(gr) #This can return NA
REMOVE <- whichis.na(neighbors))
neighbors <- neighbors[-REMOVE]
neighbor <- gr[neighbors]
gr <- gr[-REMOVE]
distances <- distance(gr, neighbor)
I just tried this on my laptop and it seems to work fine. If this isn't exactly what you want, you should be able to easily modify it.