Question

Print rows only if number matches

1

Entering edit mode

7.6 years ago

waqasnayab ▴ 250

Hi,

Dear Community,

I have a column like this:

D309
E308
G296
T297A
P415T
P415T
V457I
V457
A214G
A214
T418
I419V
P259
P259L
L191
A190
R478
R478H

. .. ...

or in other words you can say that this column is present in a very big file as column number 19. I want only those lines in which the number matches only with the next line, that is the output should be like this:

P415T
P415T
V457I
V457
A214G
A214
T418
I419V
P259
P259L
R478
R478H

I tried this command:

cut -f19 mycolumnfile.txt | uniq -d

I got this output:

P415T

As it matches with the whole line. I want only those rows in which the number matches only.

Thanks,

Waqas.

SNP next-gen sequencing • 1.4k views

ADD COMMENT • link updated 7.6 years ago by Pierre Lindenbaum 164k • written 7.6 years ago by waqasnayab ▴ 250

0

Entering edit mode

. I want only those lines in which the number matches only with the next line

not clear.

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

for example, if my input is like this:

P415T
P415T
V457I
V457
A214G
A214
T418
I419V

Whatever the character is present at first place, either P (in the first two lines) or V (in the lines three and four and so on,,,), I want to print those rows in which the numbers are repeated, that is 415 in the first two lines is repeated, or 457 in the lines three and four are repeated, so the output should be like this:

P415T
P415T
V457I
V457
A214G
A214

ADD REPLY • link 7.6 years ago by waqasnayab ▴ 250

0

Entering edit mode

7.6 years ago

Pierre Lindenbaum 164k

The idea is to use awk insert a new normalized column for your two files:

$ echo -e "a\tP415T\tb\na\tP415X\tb\na\tP415Y\tb" |\
awk -F '\t' '{key=$2; gsub(/[A-Z]$/,"",key); printf("%s\t%s\n",key,$0);}' |\
sort -t$'\t' -k1,1

P415    a   P415T   b
P415    a   P415X   b
P415    a   P415Y   b

then sort both files on this column and then use join to join both files.

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 164k

score 4 · Accepted Answer · 2017-04-21

4

Entering edit mode

7.6 years ago

guillaume.rbt ★ 1.0k

Hi,

I would do it in python, with your list of id in the file "list" (beware, not carefully tested)

import re

with open("./list", 'r') as f1:
    first = True
    last_int = 0
    last_line = ""
    for line in f1:
        if(last_int == int(re.findall("\d+", line)[0])):
            if first:
                print last_line
                print line
                first = False
            else:
                print line
        else:
            first = True
        last_int = int(re.findall("\d+", line)[0])
        last_line = line

ADD COMMENT • link 7.6 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

I checked manually as well as by your python solution, it works perfectly fine.

What if I have the multi-column file and the same column is present at column 19, and I need to do the same task? How to mention column number so that filtering would have been taking place on the basis of column 19...,,!!!!???

Thanks,

Waqas.

ADD REPLY • link 7.6 years ago by waqasnayab ▴ 250

0

Entering edit mode

given that you have a tabulated table "table", and your list of id on column 19 :

import re

with open("./table", 'r') as f1:
    first=True
    last_int=0
    last_line=""    
    for line in f1:
        if(last_int == int(re.findall("\d+", line.split('\t')[18])[0])):
            if first:
                print last_line
                print line
                first = False
            else:
                print line
        else:
            first = True
        last_int = int(re.findall("\d+", line.split('\t')[18])[0])
        last_line = line

ADD REPLY • link 7.6 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Yes, it works fine, If I made some changes to the script will come up to you..,,,!!!!

ADD REPLY • link 7.6 years ago by waqasnayab ▴ 250