Mapping a chromosome position with Gene ID
3
2
Entering edit mode
7.7 years ago
Paul ▴ 80

I am trying to map chromosome positions with all the gene IDs available.

The position file1 I have looks like this

#CHROM      POSID       REF ALT
NC_008596.1 41669   .   CG  C
NC_008596.1 78925   .   GC  G
NC_008596.1 101262  .   GC  G
NC_008596.1 115332  .   GC  G
NC_008596.1 224262  .   A   AC

and the genome position file2 looks like the following

    ChrM        Start   Stop Strand GeneID
chr NC_008596.1 499     1692    +   4535615
chr NC_008596.1 1721    2614    +   4536178
chr NC_008596.1 2624    3778    +   4532650
chr NC_008596.1 3775    4359    +   4537198
chr NC_008596.1 4591    6618    +   4531510
chr NC_008596.1 6648    9176    +   4535988
chr NC_008596.1 9229    10011   +   4533293
chr NC_008596.1 10184   10276   +   4535541
chr NC_008596.1 50411   11211   +   4531499

Now I need to map the positions in file1 (POS ID) with the GeneID given in file2.

The POS ID in file 1 is the number which lies between start and stop codon number. For e.g.41669 lies between

chr NC_008596.1 50411   11211   +   4531499

and the output should be

chr NC_008596.1 50411   11211   +   4531499  41669

the last one is POSID

Please let me know how can I do it? I can work in R, so any solution in R or an excel trick would be helpful.

SNP next-gen sequencing R excel • 3.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

to use merge, I think we need to have common column in both the files. Here, POS ID in file1 is the number which lies between Start(499) and Stop(1692) in file 2.

ADD REPLY
0
Entering edit mode

So the POS ID of file 1 should lie in between the POS of file 2? What is the criteria of merge?

ADD REPLY
0
Entering edit mode

I just updated with an example

ADD REPLY
0
Entering edit mode

have a look to reduce in R?

data<-Reduce(function(...) merge(..., all=TRUE, by=c("POSID", "GeneID")), list(file1, file2))

ADD REPLY
4
Entering edit mode
7.7 years ago

If you are working on range-intersection, genomic ranges are very cool and fast, although it requires a little bit of learning: See below posts:

http://stackoverflow.com/questions/11892241/merge-by-range-in-r-applying-loops http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window

If efficiency is not your issue ( => files are not big), then you can run a for-loop for each POSID in file1 to see if if it falls in the same range as file2.

ADD COMMENT
0
Entering edit mode

Thanks Santosh, It worked

ADD REPLY
1
Entering edit mode
7.6 years ago

convert your files to BED format using awk and the use bedtools intersect to compute the intersection

ADD COMMENT
1
Entering edit mode
7.6 years ago

Save some time and use BEDOPS.

First, clean up and sort your text files:

$ awk '(NR>1){print $1"\t"$2"\t"($2+1)"\t"$2}' file1.txt | sort-bed - > file1.bed
$ tail -n+2 file2.txt | sort-bed - > file2.bed

Then map the IDs from the first BED file file1.bed to the intervals in the second BED file file2.bed:

$ bedmap --echo --echo-map-id-uniq --delim '\t' file2.bed file1.bed > answer.bed

This should give you the exact format you want.

ADD COMMENT

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6