unique list based on multiple column
3
0
Entering edit mode
7.1 years ago
Sam ▴ 150

Hi

How I can obtain unique reads based on two different column ? Thanks

input:
    A, 1
    A, 2
    A, 2
    B, 1
    B, 2
    B, 1
    C, 1
    C, 3
    C, 3
 output:   
    A, 1
    A, 2
    B, 2
    B, 1
    C, 1
    C ,3
awk sort • 3.1k views
ADD COMMENT
2
Entering edit mode
7.1 years ago
JC 13k

Use sort and uniq commands:

sort *myfile* | uniq > output
ADD COMMENT
0
Entering edit mode

Thanks for your code , but I want obtain unique reads according two different column in my input file , please check my example

ADD REPLY
1
Entering edit mode

The code above generates results you ask for. If that example data does not represent real data then you need to provide an appropriate example.

ADD REPLY
0
Entering edit mode

if the columns are not at the beginning of the table, you can extract the columns using cut:

cut -f2,4 *myfile* | sort | uniq > output
ADD REPLY
0
Entering edit mode
7.1 years ago

Here is a way that gets around some issues with other approaches:

$ awk '!a[$0]++' input.txt > output.txt

Here's what output would look like, from your example:

$ cat output.txt
A, 1
A, 2
B, 1
B, 2
C, 1
C, 3

If your input looks like something else, then this approach would need modifications.

ADD COMMENT
0
Entering edit mode

my input format is as same as below and I need to obtain unique reads according 2nd and 4th columns

MIRT000415  , hsa-let-7a-5p,    Homosapiens,    CDK6,   1021,   Homosapiens,    Luciferase reporter assay
ADD REPLY
1
Entering edit mode

In that case, use the following modification:

$ awk -v FS=',' '!a[$2$4]++' input.txt > output.txt

This will report the first line seen for the combination of the 2nd and 4th columns. Second and subsequent "hits" are not reported.

If you want to instead use sort, you will need to use some additional options:

$ sort -u -k2,2 -k4,4 -t, input.txt > output.txt

Without reading the man pages, I'm unsure if sort is stable, so you might get a different answer on repeated runs.

In addition to flexibility on the keys used for filtering, the awk approach runs much faster on very large input (at the expense of memory), so if you're working with whole-genome scale input, then you may want to use awk, instead of sort | uniq or sort -u -based approaches.

ADD REPLY
0
Entering edit mode

Hi Alex, could you please help me about this post ? compare two text file

ADD REPLY
0
Entering edit mode

The answer here should work, I think: C: compare two text file

ADD REPLY
0
Entering edit mode

no unfortunately , I've already tested them.

ADD REPLY
0
Entering edit mode

It would perhaps be easier to help if you posted your two files somewhere public (pastebin, Dropbox, etc.), and explain more explicitly what your filters are.

ADD REPLY
0
Entering edit mode

can I have your email address ?

ADD REPLY
0
Entering edit mode

You could just use pastebin: https://pastebin.com/

ADD REPLY
0
Entering edit mode

please check this link https://mega.nz/fm/V6413RBB

ADD REPLY
0
Entering edit mode

I’m sorry but I will not be signing up for an account with that site. Just use pastebin or publish to a public folder in Dropbox or similar, if you want to.

ADD REPLY
0
Entering edit mode

Problem finally solved, it was due to blank line in text1 file. Thanks for your time

ADD REPLY
0
Entering edit mode

Then accept answer(s) that worked (use the green check mark against the answer) to provide closure to this thread.

ADD REPLY
0
Entering edit mode
7.1 years ago
aka001 ▴ 190

Based on the example in one of your comments, you can do it with this:

awk -F',' '!seen[$2,$4]++' your_file.txt

You might have problems later on when there are multiple lines with the same 2nd and 4th columns but different values in some other columns. However, as you didn't mention it, above awk will work fine.

ADD COMMENT

Login before adding your answer.

Traffic: 2619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6