Indentify character and keep corressponding character from bottom two rows
1
0
Entering edit mode
4.8 years ago

I have a large tab-delimited file and a part of it is like:

25      M   X   A   A   X   S
25_a    M   K   A   A   R   S
25_b    M   A   A   A   V   S
31      M   A   A   A   V   S
31_a    M   A   A   A   V   S
31_b    M   A   A   A   V   S

I am trying to play with three rows at a time, the first row contains a reference sequence (actual sequence) whereas the next two rows reflect its variants. I am trying to do two things:

First thing is that from the first row (reference line (25)), I am trying to identify (match) a character (X) and trying to only keep the corresponding characters in the bottom two rows (25_a, 25_b) to get something like shown below,

25      M   X   A   A   X   S
25_a        K           R   
25_b        A           V

Secondly, If there is no (X) in the reference (31) line, then remove the corresponding two rows (31_a, 31_b) to get something like this:

31      M   A   A   A   V   S

And a final output should be like

25      M   X   A   A   X   S
25_a        K           R   
25_b        A           V   
31      M   A   A   A   V   S

I have tried to use sed command which allowed me to remove data after X character within same row but I am struggling to get the desired output. I have also posted the question here but they closed my question because i was not able to explain well. Any help will be highly appreciated

RNA-Seq • 888 views
ADD COMMENT
0
Entering edit mode

I assume you don't know a programming language?

A python3 solution would be something like this (not tested):

with open("input_file.txt") as f:
  for row in f:
    row_a, row_b = next(f), next(f)
    print(row.strip())
    row = row.strip().split("  ") # split on two spaces
    row_a = "  ".join([a for c, a in zip(row, row_a.strip().split("  ")) if c == "X" else " "])
    row_b = "  ".join([b for c, b in zip(row, row_b.strip().split("  ")) if c == "X" else " "])
    if row_a.strip() and row_b.strip(): # assuming that both extra rows have to have differences 
        print(row_a, row_b, sep="\n")

Edit: semantics

ADD REPLY
1
Entering edit mode
4.8 years ago
JC 13k

Unfortunately, the command line is not always the answer, matrix processing like this is hard in sed, instead use a python/perl script, here is how I do on Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %mm =();
while (<>) {
    chomp;
    my ($id, @val) = split (/\s+/, $_);
    if ($id =~ /^\d+$/) {
        $mm{$id} = "";
        my $keep = 0;
        for (my $i=0; $i<=$#val; $i++) {
            if ($val[$i] eq "X") {
                $mm{$id} .= "1,";
                $keep++;
            }
            else {
                $mm{$id} .= "0,";
            }
        }
        $mm{$id} = "skip" unless ($keep > 0);
        $mm{$id} =~ s/,$//;
        print join "\t", $id, @val;
        print "\n";
    }
    else {
        my $pid = $id;
        $pid =~ s/_\w+//;
        if (defined $mm{$pid}) {
            next if ($mm{$pid} eq "skip");
            print "$id";
            my @pos = split(/,/, $mm{$pid});
            for (my $i=0; $i<=$#val; $i++) {
                if ($pos[$i] == 1) {
                    print "\t$val[$i]";
                }
                else {
                    print "\t";
                }
            }
            print "\n";
        }
    }
}

Using it:

$ perl parse.pl < data.txt
25      M       X       A       A       X       S
25_a            K                       R
25_b            A                       V
31      M       A       A       A       V       S
ADD COMMENT

Login before adding your answer.

Traffic: 3385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6