Question

how to remove matching Gene ID from data frame in R

0

Entering edit mode

10.0 years ago

M K ▴ 660

Hi All,

I have a file (data.frame) with column called Gene ID and I have positive pairs in another file (data.frame) and I want to remove (exclude) the positive pairs that match those in the original file to get negative pairs. I have about 3000 obs. Is there any way to remove the -ve pairs.

R • 13k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by M K ▴ 660

1

Entering edit mode

Maybe this logic might help:

Subtract (set-minus operation) ID of frame2 from ID of frame1 and store in a vector
Extract all from frame1 where IDs match to vector created above.

Not sure of the exact functions in R, but this should be possible (from my limited R xp)

ADD REPLY • link 10.0 years ago by Ram 44k

Ram · Answer 1 · 2014-11-09

1

Entering edit mode

10.0 years ago

Brice Sarver ★ 3.8k

Load the two data frames as two objects. We'll call them a and b. Let's assume that the gene IDs are stored in a column called geneid and the data frames can be subset using standard $ indexing.

The function you want is match(). Specfically,

matches <- match(a$geneid, b$geneid)

will return the indices of matching elements. You can then (negatively) subset on these indices.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Thanks Brice Sarver,

I got the indices, so how can I subset the data based on indices?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by M K ▴ 660

0

Entering edit mode

The same way you normally would. Since the indices correspond to rows, you can use [] notation.

dataframe[-matches, ]

Will subset a data frame object (here, 'dataframe') on the indices. The '-' before indicates that these rows should be excluded. Don't forget the comma; it's two-dimensional.

Edit: clarity.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

I did that,but I got difference between them not matched the sum of them, here is my work:

1. The first data frame (b) contains 22 columns and 22279 rows, Where the first column is the Gene ID.

2. The second data frame (a) which I need to exclude it from (b) contains 5 columns and 3019 rows as follows:

ENSG00000223669    ENSG00000101188    -    +    antisense;protein_coding

I kept the first column which is the Gene ID on this data frame and I removed the rest, and I used the match function, then I used the following command to get - ve pairs. Neg_pairs1<- b[-matches, ]

I got the total rows matched = 3019 and when I run Neg_pairs1<- b[-matches, ], it gave me 19613 so if we add 19613 to 3019 will be 22632 not 22279 that is mean there is a difference between them about 353 obs.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.0 years ago by M K ▴ 660

1

Entering edit mode

Could be explained by the part where each record in frame b has unique sets of (ID1,ID2) but not necessarily unique ID1s. So, when you deleted all where ID1 had a set of values, it deleted more than expected.

Try adding count(distinct matches) to 19613 and check if the result is <3019 (which is count(matches) [Sorry for SQL terms]

ADD REPLY • link 10.0 years ago by Ram 44k

0

Entering edit mode

This could be the result of the order of your match() function. It can be a little tricky. Check out help(match) to make sure that the way you are parsing your data frame is correct. You can also use %in% operator:

a$geneid %in% b$geneid

to return logicals that may make more sense and then get a subset of indices by wrapping this in which().

Be aware that match() returns the first match (I think). If there are duplicates, this will affect your outcome.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Brice Sarver ★ 3.8k