text file editing
2
0
Entering edit mode
6.5 years ago
Sam ▴ 150

Dear ALL

I have about 200 tab delimited and in first step I want merge them to have a comprehensive file, then want to summarize file to have another out put, so in summarize file just keep unique ID in 1st ,2nd and 3rd columns and for pos1 and pos2 add the number of number of lines with repetitive ID after dash symbol respectively but for other columns convert tab to / in each line and add the data of lines with repetitive ID after dash symbol respectively. I would greatly appreciate it if you kindly give me some feedback on this.

Thanks

annoPos gene  chr     pos1   pos2  ref    alt   genotype    quality MQ  ref-depth   alt-depth             
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411   82411   G   T   hom 84.5    60  0   4
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82437   82437   A   C   hom 87.5    60  0   4
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195   84195   C   T   hom 183 60  0   17
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84209   84209   G   C   hom 182 60  0   17
exonic  Eucgr.A00214.1.v2.0 Chr01   84631   84631   C   T   het 118 60  5   7
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718   84718   C   T   het 80  60  7   5
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84754   84754   C   T   het 85  60  7   5
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   89492   89492   C   T   het 78  60  2   4

convert to this :

annoPos gene                            chr pos1    pos2    ref alt genotype    quality MQ  ref-depth   alt-depth 

 UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0    Chr01   82411-82437         82411-82437         G/T-A/C         hom-hom     0-0 4-4
 exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0    Chr01   84195-84209-84631   84195-84209-84631   C/T-G/C-C/T hom-hom-het 0-0-5 17-17-7
intronic Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0    Chr01   84718-84754-89492   84718-84754-89492   C/T-C/T-C/T het-het-het 7-7-2 5-5-4
awk bash • 1.6k views
ADD COMMENT
4
Entering edit mode
6.5 years ago

@OP. Could you please clarify on alleles and quality?

UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411   82411   G   T   hom 84.5    60  0   4
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82437   82437   A   C   hom 87.5    60  0   4

For these two lines, expected resultant alleles are G/T-A/C. Is it G/T-A/C or G/A-T/C?

If the alleles are G/T-A/C (as in OP):

$ awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6"/"$7,$8,$11,$12}' test.txt | datamash -g1 unique 2 unique 3 collapse 4-9   | awk -v OFS="\t" '{gsub(",","-");gsub("-",",",$2)}1' 

AnnoPos gene    chr pos1    pos2    ref/alt genotype    ref-depth   alt-depth
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411-82437 82411-82437 G/T-A/C hom-hom 0-0 4-4
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195-84209-84631   84195-84209-84631   C/T-G/C-C/T hom-hom-het 0-0-5   17-17-7
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718-84754-89492   84718-84754-89492   C/T-C/T-C/T het-het-het 7-7-2   5-5-4

ps: screenshot in calc is attached here. table2

If the alleles are G/A-T/C, try this:

$ datamash -g1 unique 2 unique 3 collapse 4-8,11,12 < test.txt  | awk -v OFS="\t" '{gsub(",","-"); gsub("-","/",$6); gsub("-","/",$7);gsub("-",",",$2); print $1,$2,$3,$4,$5,$6"-"$7,$8,$9,$10}'

AnnoPos gene    chr pos1    pos2    ref-alt genotype    ref-depth   alt-depth
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411-82437 82411-82437 G/A-T/C hom-hom 0-0 4-4
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195-84209-84631   84195-84209-84631   C/G/C-T/C/T hom-hom-het 0-0-5   17-17-7
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718-84754-89492   84718-84754-89492   C/C/C-T/T/T het-het-het 7-7-2   5-5-4

ps: Display is not optimal. Took a screenshot of the tsv in calc.

table1

ps2: Datamash is available in most of the linux repos, conda channels and home brew linux

ADD COMMENT
0
Entering edit mode

alleles are G/T-A/C but for this command "awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6"/"$7,$8,$11,$12}' test.txt | datamash -g1 unique 2 unique 3 collapse 4-9 | awk -v OFS="\t" '{gsub(",","-");gsub("-",",",$2)}1' "

I got error : datamash: -H or --header-in must be used with named columns

ADD REPLY
0
Entering edit mode

@sam: If your data is same as the text posted in OP, it should not give any problems. Try the code on text furnished in OP. This is because no where in furnished code, datamash uses header information. I re ran the code no issues with furnished data in OP and output. Check your data again. Btw, datamash version is 1.3.

ADD REPLY
3
Entering edit mode
6.5 years ago

using awk+sqlite3 , this is not exactly the desired output but I'm lazy, and you get the idea: using the group_by function and group_concat

rm -f tmp.db && tr "-" "_" < input.tsv | awk '/^annoPos/ {split($0,header);printf("create table T(");for(i=1;i<= NF;i++) printf("%s %s TEXT",(i==1?"":","),header[i]); printf(");\n"); next;} {printf("insert into T(");for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":","),header[i]); printf(") values("); for(i=1;i<= NF;i++) printf("%s\"%s\"",(i==1?"":","),$i); printf(");\n");} END{printf("select annoPos,gene,chr,group_concat(pos1),group_concat(pos2),group_concat(ref),group_concat(alt) from T group by annoPos,gene,chr;\n");}'   | sqlite3  -header -separator $'\t'  tmp.db

annoPos gene    chr group_concat(pos1)  group_concat(pos2)  group_concat(ref)   group_concat(alt)
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411,82437 82411,82437 G,A T,C
exonic  Eucgr.A00214.1.v2.0 Chr01   84631   84631   C   T
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195,84209 84195,84209 C,G T,C
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718,84754,89492   84718,84754,89492   C,C,C   T,T,T
ADD COMMENT

Login before adding your answer.

Traffic: 1617 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6