How to find duplicate values in two columns and move them to a third?
1
0
Entering edit mode
9.1 years ago
dally ▴ 210

I have two columns of data containing merged peak chr start, start, end, gene name data that looks like the following:

Essentially what I did was copy and pasted the values I obtained using a script that finds peaks that are similar between different chip-seq experiments. Column 1 contains peaks in protein 1, 2, and 3 and Column 2 contains peaks in protein 1, 2, 3, and 4.

What I am trying to do now is find all the duplicated values in Columns A that are also found in Column B ... not just by row but if there is a similar value in B2 and it is found in A4506 then I want it to find that value also and then spit all the duplicated values into Column C.

While this dataset isn't essentially "huge" ... only about 4.5k rows in either column, it's too much to go through by hand and I can't find a good answer on any internet topic. I've tried varieties of VLOOKUP, IFCOUNT and other functions but they only search the row 1 in column A to the row 1 in column B or simply do not work on the Mac version of Excel (for instance I found a add-in that could supposedly do this but the add-in was not able to be used by Mac, I got an error indicating it was not properly formatted for mac).

Can anyone help or point me to a correct source?

excel duplicates • 5.2k views
ADD COMMENT
1
Entering edit mode

Hi, do you want any kind of solutions (with scripts) or only Excel solution?

ADD REPLY
0
Entering edit mode

Any solution will work. I am a wet lab research assistant with only minor knowledge of programming languages so if you could just try to give as much as a detailed overview as you can be bothered of how your script works and what data would go where that would be appreciated.

EDIT: I totally missed the excel answer you found. I will attempt this, but it seems to based on a windows os from screenshots. Still I should be able to reproduce it. I will edit with any results.

EDIT: Excel link did not work.

ADD REPLY
1
Entering edit mode
VLOOKUP works. Make sure you lock the lookup table (with the dollar signs or F4). This way you look for exact same coordinates, if you want to find overlapping peaks that might not have the exact same coordinates use bedops or intersectBed
ADD REPLY
1
Entering edit mode
9.1 years ago
michael.ante ★ 3.9k

Hi daily

The data looks easily transformable to bed-format. Afterwards, you can use intersectBed to find overlapping features.

Cheers,

Michael

ADD COMMENT
0
Entering edit mode

Transforming using bedtools would make use of the cat function?

ADD REPLY
2
Entering edit mode

not tested but something like this would work to create a bed file:

cat myFile.txt | tr ',' '\t' > myOutput.bed

since you have two cols you could try

cat myFile.txt | awk '{print $1}' | tr ',' '\t' > myOutput.bed
cat myFile.txt | awk '{print $2}' | tr ',' '\t' >> myOutput.bed
ADD REPLY
0
Entering edit mode

Was able to make this work, but it seems to have returned most of the coordinates which I find odd.

I used:

intersectBed -a fourproteins.bed -b threeproteins.bed > mergedproteins.bed

Do you know if intersectBed looks for only exact coordinates or for coordinates between a set distance of the coordinate on file?

ADD REPLY
0
Entering edit mode

it depends how you set it up, the documentation shows that you can consider even only one nt overlap.

ADD REPLY
0
Entering edit mode

I wonder: if my main purpose is to find what peaks in proteins 1, 2, 3 overlap but do not contain protein 4 would I use the -v function of intersectBed?

ADD REPLY

Login before adding your answer.

Traffic: 1646 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6