how to assign SNPs to each gene with overlapping intervals in r?
1
2
Entering edit mode
5.6 years ago
Ana ▴ 200

Hi all, I have ~900K SNPs and I want to assign them to genes based on their positions. here is a very simple example of what I am doing:

the first data set contains SNPs and their positions on each chromosome (here I am showing only chromosome 1, but I have 17 chromosomes in total)

  data.snps
    chrom   pos
    1   5
    1   11
    1   19
    1   39
    1   74
    1   77
    1   90

The second file contain genes, with start and stop positions on each chromosome (again just for simplicity I am showing only one chromosome)

data.genes
chrom   gene_id start   stop
1   g1  2   8
1   g2  14  20
1   g3  29  35
1   g4  37  46
1   g5  50  63
1   g6  70  75
1   g7  87  93

I want to assign each SNP on each chromosome to any of these genes on that chromosome, if it falls within the range of any of genes. I am doing this way:

data_snps$found <- ifelse(sapply(data_snps$pos, function(p) 
                 any(data_genes$start <= p & data_genes$stop >= p)),data_genes$gene_id, NA)

which gives me the output that I want

output
 chrom pos assigned
     1   5     1
     1  11    NA
     1  19     3
     1  39     4
     1  74     5
     1  77    NA
     1  90     7

However there is a problem that I am not sure how to solve it. Some of these genes overlap with each other and SNPs can fall within the range of 2 or more genes at the same time, for example in this example below g1 and g2 overlap or g2 and g5 overlap, the first SNP can fall within the range of both g1 and g2.

data.genes.with.overlap

chrom   gene_id start   stop
1   g1  2   8
1   g2  4   20
1   g3  29  35
1   g4  37  46
1   g5  18  63
1   g6  70  75
1   g7  87  93

I want want an output to assign these SNPs to each of these overlapping genes

this is my desired output for example: output.desired

chrom pos assigned.A.  assigned.B.    assigned.c  ...(if there are multiple genes that a sop falls within their range)
         1   5     1           2
         1  11    NA       NA
         1  19     3          5
         1  39     4.         4
         1  74     5          5
         1  77    NA       NA
         1  90     7          7

I sincerely appreciate you help. Thanks

r • 1.2k views
ADD COMMENT
3
Entering edit mode
5.6 years ago

It sounds like you want to assign genes (or gene IDs) to SNPs. This can be done with a mapping tool, and doing this on the command line may get you there faster:

$ bedmap --echo --echo-map-id --skip-unmapped --delim='\t' snps.bed genes.bed > A.bed
$ bedops --not-element-of 1 snps.bed A.bed | awk -vOFS="\t" '{ print $0, "NA" }' > B.bed
$ bedops --everything A.bed B.bed > answer.bed

You'll first need to put SNPs and genes into sorted and UCSC BED format to use this approach, but R's write.table() call will get you to tabular output and sort-bed will sort that file.

The file answer.bed can be brought back into R via read.table().

ADD COMMENT

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6