Question

Code for looking for overlaps

0

Entering edit mode

10.3 years ago

rrsowmya ▴ 20

I have two files A and B. I want to look for rows that overlap between these two files and retrieve only those rows from file A into a separate file altogether. Could anyone help me with a bunch of codes in R to perform this.

I cannot manipulate my data in excel since it is too large and I'm still new to R.

Thank you in anticipation for your help

R • 4.1k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by rrsowmya ▴ 20

0

Entering edit mode

the answer is bedtools: http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

What do you mean by "overlap"? Could you post a couple of example rows from file A and file B and show how what you want to have happen?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Sean Davis 27k

0

Entering edit mode

Okay here's the question again

file A:

A    gene1       33
B    gene2       34
C    gene3       89
D    gene1       09
E    gene3       33
F    gene1       86

File B

A
C
F
T
P
G

I would like A,C and F (as they overlap between the two files) into a separate file.

New file:

A    gene1       33
C    gene3       89
F    gene1       86

Hope this makes better sense.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by rrsowmya ▴ 20

1

Entering edit mode

Post this as comment to you original question and see answer by Ido Tamir: A: Code for looking for overlaps.

ADD REPLY • link 10.3 years ago by PoGibas 5.1k

1

Entering edit mode

Now its nice.

merge function will do it for you now.

result <- merge(x = file1, y = file2, by.x = "colname", by.y = "colname")

Here, colnames is the columns name, on the basis of which, tables are to be merged.

For more help: https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Deepak Tanwar ★ 4.2k

0

Entering edit mode

Using %in% may be faster than merge.

file1[file1[,1] %in% file2[,1],]

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Sean Davis 27k

0

Entering edit mode

If you're really doing it based on the letters and they are unique, you could use a command line:

grep -f fileB fileA

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Just be aware that this will not work in most (more complex) cases. For example, if 'A' is in the file, you will also get genes 'AA' 'AB' 'CAG' and anything containing A. In your 'fileB', if you can add more regex info, it can help. For example:

A
B
C

Should be:

^A\t
^B\t
^C\t

This will return only gene 'A' since it specifies that the letter 'A' must happen RIGHT after the beginning of the line (^) and must immediately be followed by a tabulation (\t).

ADD REPLY • link 10.1 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2015-04-27

Actually, merge, intersect functions are for columns, as far as I know.

Suppose: You have 2 objects obj1 and obj2:

If you only want to separate the rows in common, you could use the following function:

rows_in_common <- function(x,y)
{
  a <- apply(x, 1, paste, collapse = "")
  b <- apply(y, 1, paste, collapse = "")
  c <- x[!a %in% b,]
  return(c)
}

You can obtain result then by:

result <- rows_in_common(x = obj1, y = obj2)

Another thing is, this question is not related to Bioinformatics. This is a programming related question. Please post programming related questions from next time to: http://stackoverflow.com/

Ram · Answer 2 · 2015-04-27

If you are planning to analyze genomic ranges in R (great choice!) - GenomicRanges from Bioconductor is all what you need. And this introduction by Dave Tang is a very good start.

# At first you'll need to install it.
source("[http://bioconductor.org/biocLite.R](http://bioconductor.org/biocLite.R)")
biocLite("GenomicRanges")
library("GenomicRanges")

# Read in your data
data <- read.table("test.bed",header=F)
# Format it

# Read in and format second file

# And intersect
intersect(bed, bed2)

There is more documentation here.

Ram · Answer 3 · 2015-04-27

0

Entering edit mode

10.3 years ago

Ido Tamir 5.2k

This is called a join and the function is called merge in R

merge two files

I would suggest you start by a) learning to define your questions more precisely b) look at e.g. http://swirlstats.com/ or other online resources for R.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Ido Tamir 5.2k

0

Entering edit mode

Ohhh, now I get it ... This was about merge and not intersect :-|

ADD REPLY • link 10.3 years ago by PoGibas 5.1k

0

Entering edit mode

maybe it was, I don't know. Thats why I wrote xhe should be more precise in questioning.

ADD REPLY • link 10.3 years ago by Ido Tamir 5.2k