Hi all, I have ~900K SNPs and I want to assign them to genes based on their positions. here is a very simple example of what I am doing:
the first data set contains SNPs and their positions on each chromosome (here I am showing only chromosome 1, but I have 17 chromosomes in total)
data.snps
chrom pos
1 5
1 11
1 19
1 39
1 74
1 77
1 90
The second file contain genes, with start and stop positions on each chromosome (again just for simplicity I am showing only one chromosome)
data.genes
chrom gene_id start stop
1 g1 2 8
1 g2 14 20
1 g3 29 35
1 g4 37 46
1 g5 50 63
1 g6 70 75
1 g7 87 93
I want to assign each SNP on each chromosome to any of these genes on that chromosome, if it falls within the range of any of genes. I am doing this way:
data_snps$found <- ifelse(sapply(data_snps$pos, function(p)
any(data_genes$start <= p & data_genes$stop >= p)),data_genes$gene_id, NA)
which gives me the output that I want
output
chrom pos assigned
1 5 1
1 11 NA
1 19 3
1 39 4
1 74 5
1 77 NA
1 90 7
However there is a problem that I am not sure how to solve it. Some of these genes overlap with each other and SNPs can fall within the range of 2 or more genes at the same time, for example in this example below g1 and g2 overlap or g2 and g5 overlap, the first SNP can fall within the range of both g1 and g2.
data.genes.with.overlap
chrom gene_id start stop
1 g1 2 8
1 g2 4 20
1 g3 29 35
1 g4 37 46
1 g5 18 63
1 g6 70 75
1 g7 87 93
I want want an output to assign these SNPs to each of these overlapping genes
this is my desired output for example: output.desired
chrom pos assigned.A. assigned.B. assigned.c ...(if there are multiple genes that a sop falls within their range)
1 5 1 2
1 11 NA NA
1 19 3 5
1 39 4. 4
1 74 5 5
1 77 NA NA
1 90 7 7
I sincerely appreciate you help. Thanks