Transform duplicate rows into columns.
2
0
Entering edit mode
7.0 years ago
unique379 ▴ 120

Hi all, I would like to make duplicate row lines as columns. So it would be easier to extract repetitive columns as comma separated format.

These are transcript isoforms that means one gene has multiple transcript, which I want in one columns by comma separated.

Input:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;

Desired output:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552;   Clk1;   protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213; 
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446;  Cbx3;   protein_coding; protein_coding;protein_coding;  protein_coding;

7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451;ENSMUST00000129493;  ENSMUST00000129493; protein_coding; Pirb-201;Pirb-201;  protein_coding;retained_intron;

I tried but this transform but i dont know how to find duplicates for each genes its different for each one basically i need repetitive isoform in one line:

awk 'ORS=NR%7?" ":"\n"'  test.txt

Also:

cut -f 1 test.txt | sort | uniq > gene
for f in `cat gene`; do  out=`cat test.txt | grep "$f" | tr '\n' '\t'`; echo $out; done > out.txt

but output is not something i desired. Please help

annotations RNA-Seq bash R linux • 1.7k views
ADD COMMENT
0
Entering edit mode

I just don't understand why you want to convert file into that format. working on duplicated rows is way much easier than this conversion

ADD REPLY
3
Entering edit mode
7.0 years ago
second_exon ▴ 210

A quick python3 solution:

make sure your input file is tab-separated

out = {}
with open("input.txt") as f:
    for lines in f:
        lines = lines.strip().split("\t")
        if lines[0] not in out:
            out[lines[0]] = [set(), set(), set(), set(), set(), set()]
        for i, vals in enumerate(lines[1:]):
            out[lines[0]][i].add(vals)

with open("outfile.txt", "w") as o:
    for kys, vls in out.items():
        o.write(kys+"\t")
        for vl in vls:
            o.write("|".join(vl)+"\t")
        o.write("\n")
ADD COMMENT
0
Entering edit mode

Excellent Thank you.

ADD REPLY
0
Entering edit mode
7.0 years ago

output (there was an extract space in last two rows. To remove it sed was used. If data is uniform, you can use datamash direct):

$ sed 's/\s\+/ /g' test.txt | datamash -t " " -sg 1 unique 2-6 | sed 's/;,/;/g'

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552; Clk1; protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446; Cbx3; protein_coding; protein_coding;
7_3717137_3717138 7_3717137_3717138,ENSMUSG00000058818; ENSMUST00000078451;ENSMUST00000129493; ENSMUST00000129493;Pirb; protein_coding; Pirb-201;

input (copy/pasted from OP):

$ cat test.txt 
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;
ADD COMMENT
0
Entering edit mode

Thank you for introducing datamash, i never knew before. I have to install and check if its work for me. But thank you.

ADD REPLY
0
Entering edit mode

if you are using any one of the main distros (RHEL, centos, ubuntu, mint), datamash is in repos.

ADD REPLY

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6