Question

Indentify character and keep corressponding character from bottom two rows

0

Entering edit mode

4.7 years ago

waqaskhokhar999 ▴ 160

I have a large tab-delimited file and a part of it is like:

25      M   X   A   A   X   S
25_a    M   K   A   A   R   S
25_b    M   A   A   A   V   S
31      M   A   A   A   V   S
31_a    M   A   A   A   V   S
31_b    M   A   A   A   V   S

I am trying to play with three rows at a time, the first row contains a reference sequence (actual sequence) whereas the next two rows reflect its variants. I am trying to do two things:

First thing is that from the first row (reference line (25)), I am trying to identify (match) a character (X) and trying to only keep the corresponding characters in the bottom two rows (25_a, 25_b) to get something like shown below,

25      M   X   A   A   X   S
25_a        K           R   
25_b        A           V

Secondly, If there is no (X) in the reference (31) line, then remove the corresponding two rows (31_a, 31_b) to get something like this:

31      M   A   A   A   V   S

And a final output should be like

25      M   X   A   A   X   S
25_a        K           R   
25_b        A           V   
31      M   A   A   A   V   S

I have tried to use sed command which allowed me to remove data after X character within same row but I am struggling to get the desired output. I have also posted the question here but they closed my question because i was not able to explain well. Any help will be highly appreciated

RNA-Seq • 872 views

ADD COMMENT • link updated 4.7 years ago by JC 13k • written 4.7 years ago by waqaskhokhar999 ▴ 160

0

Entering edit mode

I assume you don't know a programming language?

A python3 solution would be something like this (not tested):

with open("input_file.txt") as f:
  for row in f:
    row_a, row_b = next(f), next(f)
    print(row.strip())
    row = row.strip().split("  ") # split on two spaces
    row_a = "  ".join([a for c, a in zip(row, row_a.strip().split("  ")) if c == "X" else " "])
    row_b = "  ".join([b for c, b in zip(row, row_b.strip().split("  ")) if c == "X" else " "])
    if row_a.strip() and row_b.strip(): # assuming that both extra rows have to have differences 
        print(row_a, row_b, sep="\n")

Edit: semantics

ADD REPLY • link 4.7 years ago by cschu181 ★ 2.8k

score 1 · Answer 1 · 2020-04-01

Unfortunately, the command line is not always the answer, matrix processing like this is hard in sed, instead use a python/perl script, here is how I do on Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %mm =();
while (<>) {
    chomp;
    my ($id, @val) = split (/\s+/, $_);
    if ($id =~ /^\d+$/) {
        $mm{$id} = "";
        my $keep = 0;
        for (my $i=0; $i<=$#val; $i++) {
            if ($val[$i] eq "X") {
                $mm{$id} .= "1,";
                $keep++;
            }
            else {
                $mm{$id} .= "0,";
            }
        }
        $mm{$id} = "skip" unless ($keep > 0);
        $mm{$id} =~ s/,$//;
        print join "\t", $id, @val;
        print "\n";
    }
    else {
        my $pid = $id;
        $pid =~ s/_\w+//;
        if (defined $mm{$pid}) {
            next if ($mm{$pid} eq "skip");
            print "$id";
            my @pos = split(/,/, $mm{$pid});
            for (my $i=0; $i<=$#val; $i++) {
                if ($pos[$i] == 1) {
                    print "\t$val[$i]";
                }
                else {
                    print "\t";
                }
            }
            print "\n";
        }
    }
}

Using it:

$ perl parse.pl < data.txt
25      M       X       A       A       X       S
25_a            K                       R
25_b            A                       V
31      M       A       A       A       V       S