Hi all, I would like to make duplicate row lines as columns. So it would be easier to extract repetitive columns as comma separated format.
These are transcript isoforms that means one gene has multiple transcript, which I want in one columns by comma separated.
Input:
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1; protein_coding; Clk1-201; protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1; protein_coding; Clk1-204; retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1; protein_coding; Clk1-206; retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1; protein_coding; Clk1-210; nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1; protein_coding; Clk1-211; nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1; protein_coding; Clk1-212; retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1; protein_coding; Clk1-213; retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3; protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3; protein_coding; protein_coding; protein_coding;
7_3717137_3717138 7_3717137_3717138 ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201; protein_coding;
7_3717137_3717138 ENSMUSG00000058818; ENSMUST00000129493; Pirb; protein_coding; Pirb-201; retained_intron;
Desired output:
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552; Clk1; protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446; Cbx3; protein_coding; protein_coding;protein_coding; protein_coding;
7_3717137_3717138 7_3717137_3717138 ENSMUST00000078451;ENSMUST00000129493; ENSMUST00000129493; protein_coding; Pirb-201;Pirb-201; protein_coding;retained_intron;
I tried but this transform but i dont know how to find duplicates for each genes its different for each one basically i need repetitive isoform in one line:
awk 'ORS=NR%7?" ":"\n"' test.txt
Also:
cut -f 1 test.txt | sort | uniq > gene
for f in `cat gene`; do out=`cat test.txt | grep "$f" | tr '\n' '\t'`; echo $out; done > out.txt
but output is not something i desired. Please help
I just don't understand why you want to convert file into that format. working on duplicated rows is way much easier than this conversion