Question

Compare two columns of same file as paired rows

0

Entering edit mode

7.2 years ago

waqasnayab ▴ 250

Hi,

Dear Community,

I have a file like:

CHR POS AACHANGE    EFFECT
chr1    11779914    R88W    _
chr1    11779915    R88Q    _
chr1    19694500    R478    H
chr1    19694501    R478H   H
chr1    23875429    Q63     R
chr1    23875430    Q63R    R
chr1    84574386    R10     L
chr1    84574387    R10L    L
chr1    89186388    E551G   R
chr1    89186389    E551K   R
chr19   57506385    A97E    E
chr19   57506386    A97     E

As you can see, each row is paired on the basis of its chromosomal position. First and second entry (11779914 and 11779915), third and fourth entry (19694500 and 19694501) and so on...,,,,!!!! I want to compare column three and four. If any of the entry in column four matches with any of the entry in column in three, than remove both paired lines. So, I want output something like this:

    CHR POS AACHANGE    EFFECT
    chr1    11779914    R88W    _
    chr1    11779915    R88Q    _
    chr1    89186388    E551G   R
    chr1    89186389    E551K   R

Since entry three and four of column four has H that matched with the column three (entry four of column three - R478H), both lines are removed (entry three and four).

Any command line solution, or python etc is appreciated. Mostly comparing columns of two different files are available but of the same file paired rows, I failed to find my solution.

Thanks,

Waqas.

genome snp next-gen • 3.5k views

ADD COMMENT • link updated 7.2 years ago by Kevin Blighe 88k • written 7.2 years ago by waqasnayab ▴ 250

score 3 · Accepted Answer · 2017-09-07

I assume that the match between column 4 and 3 is only a partial (and not exact) match?, i.e., you are searching for, in column 3, the amino acid residue defined in column 4?

This looks like a reasonably easy job in R Programming Language but here are 2 BASH commands that will also work. It is made a bit more difficult by the fact that the matching pairs are not always in the same order (e.g., sometimes the match between 3 and 4 is in the second pair, whilst in others it is the first in the pair).

awk 'index($3, $4) {print}' test.tsv | cut -f3 | sed s'/.$//' > matches.list

This first records the ones that do match and saves the value in column 3 (less the final matching amino acid residue).

awk '!index($3, $4) { print }' test.tsv | grep -v -f matches.list

This then searches for the ones that do not match, and then also eliminates any value recorded in matches.list

The assumptions that I make: - your data is tab-delimited and is contained in test.tsv - the matching amino acid residue is always on the right-hand-side of the amino-acid change defined in column 3

Kevin