Question

How To Evaluate And Extract The General Clustering Of Genes?

0

Entering edit mode

14.0 years ago

Chuangye • 0

These following miRNAs locate in intergene region and the length of chrX is 165556469 bp. How to evaluate and extract the general clustering of miRNA genes and provides the statistical signiﬁcance for the miRNA clustering? Could some one share your solving methods? Thanks in advance!

chrX    100318893    100318993    mmu-mir-672    -
chrX    101547000    101547088    mmu-mir-384    -
chrX    101581800    101581898    mmu-mir-325    -
chrX    135657311    135657376    mmu-mir-3475    +
chrX    139544157    139544267    mmu-mir-680-2    +
chrX    18303252    18303347    mmu-mir-221    -
chrX    18303851    18303930    mmu-mir-222    -
chrX    48986319    48986394    mmu-mir-363    -
chrX    48986464    48986555    mmu-mir-92a-2    -
chrX    48986609    48986693    mmu-mir-19b-2    -
chrX    48986739    48986819    mmu-mir-20b    -
chrX    48986957    48987040    mmu-mir-18b    -
chrX    48987129    48987194    mmu-mir-106a    -
chrX    49292623    49292705    mmu-mir-450b    -
chrX    49292780    49292871    mmu-mir-450a-1    -
chrX    49292925    49292994    mmu-mir-450a-2    -
chrX    49294029    49294114    mmu-mir-542    -
chrX    49297881    49297980    mmu-mir-351    -
chrX    49298610    49298681    mmu-mir-503    -
chrX    49298881    49298976    mmu-mir-322    -
chrX    63037421    63037483    mmu-mir-743a    -
chrX    63037920    63037997    mmu-mir-743b    -
chrX    63041037    63041102    mmu-mir-742    -
chrX    63041422    63041498    mmu-mir-883a    -
chrX    63050554    63050632    mmu-mir-883b    -
chrX    63053259    63053326    mmu-mir-471    -
chrX    63057469    63057540    mmu-mir-741    -
chrX    63059887    63059962    mmu-mir-463    -
chrX    63061194    63061272    mmu-mir-880    -
chrX    63062172    63062250    mmu-mir-878    -
chrX    63062608    63062686    mmu-mir-881    -
chrX    63071092    63071169    mmu-mir-871    -
chrX    63074615    63074690    mmu-mir-470    -
chrX    63086619    63086700    mmu-mir-465c-1    -
chrX    63089866    63089945    mmu-mir-465b-1    -
chrX    63093181    63093262    mmu-mir-465c-2    -
chrX    63096428    63096507    mmu-mir-465b-2    -
chrX    63099716    63099790    mmu-mir-465a    -
chrX    63213842    63213925    mmu-mir-1274a    +
chrX    6394641    6394733    mmu-mir-500    -
chrX    6398201    6398310    mmu-mir-501    -
chrX    6398940    6399005    mmu-mir-362    -
chrX    6404947    6405015    mmu-mir-188    -
chrX    6405360    6405456    mmu-mir-532    -
chrX    85012192    85012272    mmu-mir-1906-2    -
chrX    99775639    99775715    mmu-mir-421    -
chrX    99775778    99775873    mmu-mir-374    -
chrX    99775803    99775852    mmu-mir-374c    +

mirna clustering • 3.0k views

ADD COMMENT • link updated 9.6 years ago by Biostar 20 • written 14.0 years ago by Chuangye • 0

1

Entering edit mode

"Clustering" has several meanings - it's not clear what you want to do from this question. Are you interested in chromosomal positions? Could you be more specific?

ADD REPLY • link 14.0 years ago by Neilfws 49k

1

Entering edit mode

Sorry but this doesn't make much more sense to me than before... You will need some experimental data then on gene expression then, g. microarray data and wonder where to find it? Is that what you are asking? Or do you refer to the genomic position of the miRNA, in a cluster? Or do you want to do a database search about these features?

ADD REPLY • link 14.0 years ago by Michael 55k

0

Entering edit mode

Thank you for your good question. Clustered miRNAs have similar gene expression patterns and are transcribed together as a polycistron. But I haven't the promoter data. So if the consecutive miRNAs was <3000 nt, how could I get the cluster information?

ADD REPLY • link 14.0 years ago by Chuangye • 0

0

Entering edit mode

I agree. I cannot make sense of this question as it is written. Pose a clear question and you will most likely receive helpful responses.

ADD REPLY • link 14.0 years ago by Larry_Parnell 16k

0

Entering edit mode

Dear Chunagye, please reformulate your question. I'll close down this question as being too generic.

ADD REPLY • link 14.0 years ago by Istvan Albert 102k

0

Entering edit mode

I think they are asking how to group genes with starts that lie in a 3000 bp window?

ADD REPLY • link 14.0 years ago by Neilfws 49k

0

Entering edit mode

neilfws has expressed my question in simple words. Thank you very much! Now I also realize it is not a wise question.

ADD REPLY • link 14.0 years ago by Chuangye • 0

0

Entering edit mode

OK, re-opening this now that we understand the question.

ADD REPLY • link 14.0 years ago by Neilfws 49k

Ram · Answer 1 · 2010-12-01

So, if we understand the question correctly, you want to group the genes according to whether their starts are within 3000 bp of one another.

To approach the problem, we first need to define an algorithm. It goes something like this:

Read the data file
Sort the genes into (+) and (-) strand
Order each set of genes by start position
Define start = position of the first gene and cluster = 0
For each gene: is its start + 3000 >= start?
- if yes, start = start(current gene), increment cluster by 1, cluster(current gene) = cluster
- if no (it's in the same cluster), cluster(current gene) = cluster

Next, you need to implement that in the language of your choice. Here's a solution in R. It assumes that the data in your question are saved in a file named mirna.txt.

# read data file
genes           <- read.table("mirna.txt", header = F)
colnames(genes) <- c("chr", "start", "end", "name", "strand")

# split into (+) and (-), sort by start
genes.plus  <- subset(genes, strand == "+")
genes.minus <- subset(genes, strand == "-")
genes.plus  <- genes.plus[sort.list(genes.plus$start, decreasing = F),]
genes.minus <- genes.minus[sort.list(genes.minus$start, decreasing = F),]

# function to cluster genes
start <- genes.minus$start[1]
clust <- 0

findClusters <- function(data) {
  for(i in 1:nrow(data)) {
    if(data$start[i] >= start + 3000) {
      clust <- clust + 1
      start <- data$start[i]
      data$clust[i] <- clust
    }
    else {
      data$clust[i] <- clust
    }
  }
  data
}

# run on each gene set
findClusters(genes.plus)
findClusters(genes.minus)

If you run that, you should see that the (+) genes do not fall into clusters (none are within 3000 bp of each other). However, the first few lines of the (-) genes look like this:

   chr    start     end        name            strand clust
40 chrX   6394641   6394733    mmu-mir-500      -     0
41 chrX   6398201   6398310    mmu-mir-501      -     1
42 chrX   6398940   6399005    mmu-mir-362      -     1
43 chrX   6404947   6405015    mmu-mir-188      -     2
44 chrX   6405360   6405456    mmu-mir-532      -     2
6  chrX  18303252  18303347    mmu-mir-221      -     3
7  chrX  18303851  18303930    mmu-mir-222      -     3
8  chrX  48986319  48986394    mmu-mir-363      -     4
9  chrX  48986464  48986555  mmu-mir-92a-2      -     4
10 chrX  48986609  48986693  mmu-mir-19b-2      -     4
11 chrX  48986739  48986819    mmu-mir-20b      -     4
12 chrX  48986957  48987040    mmu-mir-18b      -     4
13 chrX  48987129  48987194   mmu-mir-106a      -     4

With 24 clusters (0-23, including those with only one gene), altogether.