how to remove matching Gene ID from data frame in R
1
0
Entering edit mode
10.0 years ago
M K ▴ 660

Hi All,

I have a file (data.frame) with column called Gene ID and I have positive pairs in another file (data.frame) and I want to remove (exclude) the positive pairs that match those in the original file to get negative pairs. I have about 3000 obs. Is there any way to remove the -ve pairs.

R • 13k views
ADD COMMENT
1
Entering edit mode

Maybe this logic might help:

  1. Subtract (set-minus operation) ID of frame2 from ID of frame1 and store in a vector
  2. Extract all from frame1 where IDs match to vector created above.

Not sure of the exact functions in R, but this should be possible (from my limited R xp)

ADD REPLY
1
Entering edit mode
10.0 years ago
Brice Sarver ★ 3.8k

Load the two data frames as two objects. We'll call them a and b. Let's assume that the gene IDs are stored in a column called geneid and the data frames can be subset using standard $ indexing.

The function you want is match(). Specfically,

matches <- match(a$geneid, b$geneid)

will return the indices of matching elements. You can then (negatively) subset on these indices.

ADD COMMENT
0
Entering edit mode

Thanks Brice Sarver,

I got the indices, so how can I subset the data based on indices?

ADD REPLY
0
Entering edit mode

The same way you normally would. Since the indices correspond to rows, you can use [] notation.

dataframe[-matches, ]

Will subset a data frame object (here, 'dataframe') on the indices. The '-' before indicates that these rows should be excluded. Don't forget the comma; it's two-dimensional.

Edit: clarity.

ADD REPLY
0
Entering edit mode

I did that,but I got difference between them not matched the sum of them, here is my work:

1. The first data frame (b) contains 22 columns and 22279 rows, Where the first column is the Gene ID.

2. The second data frame (a) which I need to exclude it from (b) contains 5 columns and 3019 rows as follows:

ENSG00000223669    ENSG00000101188    -    +    antisense;protein_coding

I kept the first column which is the Gene ID on this data frame and I removed the rest, and I used the match function, then I used the following command to get - ve pairs. Neg_pairs1<- b[-matches, ]

I got the total rows matched = 3019 and when I run Neg_pairs1<- b[-matches, ], it gave me 19613 so if we add 19613 to 3019 will be 22632 not 22279 that is mean there is a difference between them about 353 obs.

ADD REPLY
1
Entering edit mode

Could be explained by the part where each record in frame b has unique sets of (ID1,ID2) but not necessarily unique ID1s. So, when you deleted all where ID1 had a set of values, it deleted more than expected.

Try adding count(distinct matches) to 19613 and check if the result is <3019 (which is count(matches) [Sorry for SQL terms]

ADD REPLY
0
Entering edit mode

This could be the result of the order of your match() function. It can be a little tricky. Check out help(match) to make sure that the way you are parsing your data frame is correct. You can also use %in% operator:

a$geneid %in% b$geneid

to return logicals that may make more sense and then get a subset of indices by wrapping this in which().

Be aware that match() returns the first match (I think). If there are duplicates, this will affect your outcome.

ADD REPLY

Login before adding your answer.

Traffic: 1537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6