Question

unique list based on multiple column

0

Entering edit mode

7.7 years ago

Sam ▴ 150

Hi

How I can obtain unique reads based on two different column ? Thanks

input:
    A, 1
    A, 2
    A, 2
    B, 1
    B, 2
    B, 1
    C, 1
    C, 3
    C, 3
 output:   
    A, 1
    A, 2
    B, 2
    B, 1
    C, 1
    C ,3

awk sort • 3.7k views

ADD COMMENT • link updated 7.7 years ago by aka001 ▴ 190 • written 7.7 years ago by Sam ▴ 150

score 2 · Answer 1 · 2017-11-01

2

Entering edit mode

7.7 years ago

JC 13k

Use sort and uniq commands:

sort *myfile* | uniq > output

ADD COMMENT • link 7.7 years ago by JC 13k

0

Entering edit mode

Thanks for your code , but I want obtain unique reads according two different column in my input file , please check my example

ADD REPLY • link 7.7 years ago by Sam ▴ 150

1

Entering edit mode

The code above generates results you ask for. If that example data does not represent real data then you need to provide an appropriate example.

ADD REPLY • link 7.7 years ago by GenoMax 152k

0

Entering edit mode

if the columns are not at the beginning of the table, you can extract the columns using cut:

cut -f2,4 *myfile* | sort | uniq > output

ADD REPLY • link 7.7 years ago by JC 13k

score 0 · Answer 2 · 2017-11-01

0

Entering edit mode

7.7 years ago

Alex Reynolds 36k

Here is a way that gets around some issues with other approaches:

$ awk '!a[$0]++' input.txt > output.txt

Here's what output would look like, from your example:

$ cat output.txt
A, 1
A, 2
B, 1
B, 2
C, 1
C, 3

If your input looks like something else, then this approach would need modifications.

ADD COMMENT • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

my input format is as same as below and I need to obtain unique reads according 2nd and 4th columns

MIRT000415  , hsa-let-7a-5p,    Homosapiens,    CDK6,   1021,   Homosapiens,    Luciferase reporter assay

ADD REPLY • link 7.7 years ago by Sam ▴ 150

1

Entering edit mode

In that case, use the following modification:

$ awk -v FS=',' '!a[$2$4]++' input.txt > output.txt

This will report the first line seen for the combination of the 2nd and 4th columns. Second and subsequent "hits" are not reported.

If you want to instead use sort, you will need to use some additional options:

$ sort -u -k2,2 -k4,4 -t, input.txt > output.txt

Without reading the man pages, I'm unsure if sort is stable, so you might get a different answer on repeated runs.

In addition to flexibility on the keys used for filtering, the awk approach runs much faster on very large input (at the expense of memory), so if you're working with whole-genome scale input, then you may want to use awk, instead of sort | uniq or sort -u -based approaches.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi Alex, could you please help me about this post ? compare two text file

ADD REPLY • link 7.7 years ago by Sam ▴ 150

0

Entering edit mode

The answer here should work, I think: C: compare two text file

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

no unfortunately , I've already tested them.

ADD REPLY • link 7.7 years ago by Sam ▴ 150

0

Entering edit mode

It would perhaps be easier to help if you posted your two files somewhere public (pastebin, Dropbox, etc.), and explain more explicitly what your filters are.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

can I have your email address ?

ADD REPLY • link 7.7 years ago by Sam ▴ 150

0

Entering edit mode

You could just use pastebin: https://pastebin.com/

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

please check this link https://mega.nz/fm/V6413RBB

ADD REPLY • link 7.7 years ago by Sam ▴ 150

0

Entering edit mode

I’m sorry but I will not be signing up for an account with that site. Just use pastebin or publish to a public folder in Dropbox or similar, if you want to.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Problem finally solved, it was due to blank line in text1 file. Thanks for your time

ADD REPLY • link 7.7 years ago by Sam ▴ 150

0

Entering edit mode

Then accept answer(s) that worked (use the green check mark against the answer) to provide closure to this thread.

ADD REPLY • link 7.7 years ago by GenoMax 152k

score 0 · Answer 3 · 2017-11-02

Based on the example in one of your comments, you can do it with this:

awk -F',' '!seen[$2,$4]++' your_file.txt

You might have problems later on when there are multiple lines with the same 2nd and 4th columns but different values in some other columns. However, as you didn't mention it, above awk will work fine.