Question

Mapping a chromosome position with Gene ID

2

Entering edit mode

8.3 years ago

Paul ▴ 80

I am trying to map chromosome positions with all the gene IDs available.

The position file1 I have looks like this

#CHROM      POSID       REF ALT
NC_008596.1 41669   .   CG  C
NC_008596.1 78925   .   GC  G
NC_008596.1 101262  .   GC  G
NC_008596.1 115332  .   GC  G
NC_008596.1 224262  .   A   AC

and the genome position file2 looks like the following

    ChrM        Start   Stop Strand GeneID
chr NC_008596.1 499     1692    +   4535615
chr NC_008596.1 1721    2614    +   4536178
chr NC_008596.1 2624    3778    +   4532650
chr NC_008596.1 3775    4359    +   4537198
chr NC_008596.1 4591    6618    +   4531510
chr NC_008596.1 6648    9176    +   4535988
chr NC_008596.1 9229    10011   +   4533293
chr NC_008596.1 10184   10276   +   4535541
chr NC_008596.1 50411   11211   +   4531499

Now I need to map the positions in file1 (POS ID) with the GeneID given in file2.

The POS ID in file 1 is the number which lies between start and stop codon number. For e.g.41669 lies between

chr NC_008596.1 50411   11211   +   4531499

and the output should be

chr NC_008596.1 50411   11211   +   4531499  41669

the last one is POSID

Please let me know how can I do it? I can work in R, so any solution in R or an excel trick would be helpful.

SNP next-gen sequencing R excel • 4.4k views

ADD COMMENT • link 8.3 years ago by Paul ▴ 80

0

Entering edit mode

did you look at merge in R? https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

to use merge, I think we need to have common column in both the files. Here, POS ID in file1 is the number which lies between Start(499) and Stop(1692) in file 2.

ADD REPLY • link 8.3 years ago by Paul ▴ 80

0

Entering edit mode

So the POS ID of file 1 should lie in between the POS of file 2? What is the criteria of merge?

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

I just updated with an example

ADD REPLY • link 8.3 years ago by Paul ▴ 80

0

Entering edit mode

have a look to reduce in R?

data<-Reduce(function(...) merge(..., all=TRUE, by=c("POSID", "GeneID")), list(file1, file2))

ADD REPLY • link 8.3 years ago by Lila M ★ 1.3k

zx8754 · Answer 1 · 2017-03-29

4

Entering edit mode

8.3 years ago

Santosh Anand 5.8k

If you are working on range-intersection, genomic ranges are very cool and fast, although it requires a little bit of learning: See below posts:

http://stackoverflow.com/questions/11892241/merge-by-range-in-r-applying-loops http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window

If efficiency is not your issue ( => files are not big), then you can run a for-loop for each POSID in file1 to see if if it falls in the same range as file2.

ADD COMMENT • link updated 8.3 years ago by zx8754 12k • written 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

Thanks Santosh, It worked

ADD REPLY • link 8.3 years ago by Paul ▴ 80

score 1 · Answer 2 · 2017-04-20

1

Entering edit mode

8.3 years ago

Pierre Lindenbaum 166k

convert your files to BED format using awk and the use bedtools intersect to compute the intersection

ADD COMMENT • link 8.3 years ago by Pierre Lindenbaum 166k

score 1 · Answer 3 · 2017-04-20

Save some time and use BEDOPS.

First, clean up and sort your text files:

$ awk '(NR>1){print $1"\t"$2"\t"($2+1)"\t"$2}' file1.txt | sort-bed - > file1.bed
$ tail -n+2 file2.txt | sort-bed - > file2.bed

Then map the IDs from the first BED file file1.bed to the intervals in the second BED file file2.bed:

$ bedmap --echo --echo-map-id-uniq --delim '\t' file2.bed file1.bed > answer.bed

This should give you the exact format you want.