AWK for replacing the content of a column in a csv file with the content of another column from another csv WITHOUT USING THE COLUMN NUMBER
1
0
Entering edit mode
3 months ago

Hello all,

I am at a loss here.

I am making a bash script as generic as possible and I want to replace one column in a csv file (test1.csv) with a column from another csv file (test4b.csv).

I know how to do this using the number of column in awk, however, I am trying to generate a very generic script as these files may present a variable number of fields while the header of each field will stay unvaried. I want to do this using the header of the field.

I have asked on forums and all I got was partial replies (with no explaination) and rude comments telling me to LEARN first and ask question after (?).

Well, I am learning... I am a biologist with no background in coding and I have not found any documentation to help me with this.

Putting together some suggestions from forums, this is what I have so far but it is not work.

awk -v col="index2" -v rc="2xedni" '
BEGIN{
  FS=OFS=","
  }
NR==1{
  for(i=1;i<=NF;i+=1){
    columns[$i]=i
  };
  print
}
NR>1{
  $(columns[col])=rc;
  print
}
NR>1' test1.csv test4b.csv > test5.csv
cat test1.csv
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,O_bru,1,A01,UDI_55,GACAGTAA,UDI_55,GTAATCGC,,
2,A_lych,1,B01,UDI_56,CCTTCGCA,UDI_56,AAGTGATG,,
3,A_litu,1,C01,UDI_57,CATGATCG,UDI_57,GGAGAGTA,,
cat test4b.csv 
GCGATTAC
CATCACTT
TACTCTCC

The desired output is

cat test5.csv 
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,O_bru,1,A01,UDI_55,GACAGTAA,UDI_55,GCGATTAC,,
2,A_lych,1,B01,UDI_56,CCTTCGCA,UDI_56,CATCACTT,,
3,A_litu,1,C01,UDI_57,CATGATCG,UDI_57,TACTCTCC,,

The output I get with the line of code above

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,O_bru,1,A01,UDI_55,GACAGTAA,UDI_55,1,,
2,A_lych,1,B01,UDI_56,CCTTCGCA,UDI_56,2,,
3,A_litu,1,C01,UDI_57,CATGATCG,UDI_57,3,,
4,A_trag,1,D01,UDI_58,TCCTTGGT,UDI_58,4,,
5,C_trap,1,E01,UDI_59,GTCATCTA,UDI_59,5,,
6,P_pen,1,F01,UDI_60,GAACCTAG,UDI_60,6,,
7,A_aura,1,G01,UDI_61,CAGCAAGG,UDI_61,7,,
8,O_goth,1,H01,UDI_62,CGTTACCA,UDI_62,8,,
9,C_pasi,1,A02,UDI_63,TCCAGCAA,UDI_63,9,,
10,A_hera,1,B02,UDI_64,CAGGAGCC,UDI_64,10,,
11,I_dimi,1,C02,UDI_65,TTACGCAC,UDI_65,11,,
12,I_fusc,1,D02,UDI_66,AGGTTATC,UDI_66,12,,
13,M_tith,1,E02,UDI_67,TCGCCTTG,UDI_67,13,,
14,H_post,1,F02,UDI_68,CCAGAGCT,UDI_68,14,,
15,D_punc,1,G02,UDI_69,TACTTAGC,UDI_69,15,,
16,D_nigr,1,H02,UDI_70,GTCTGATG,UDI_70,16,,
17,E_abbr,1,A03,UDI_71,TCTCGGTC,UDI_71,17,,
18,P_scab,1,B03,UDI_72,AAGACACT,UDI_72,18,,
19,L_rube,1,C03,UDI_73,CTACCAGG,UDI_73,19,,
20,IsMShag_M,1,D03,UDI_74,ACTGTATC,UDI_74,20,,
21,IsMShag_F,1,E03,UDI_75,CTGTGGCG,UDI_75,21,,
22,Laura_Hayes,1,F03,UDI_76,TGTAATCA,UDI_76,22,,
23,Owen_Trimming,1,G03,UDI_77,TTATATCT,UDI_77,23,,
GCGATTAC,,,,,,,GCGATTAC
CATCACTT,,,,,,,CATCACTT
TACTCTCC,,,,,,,TACTCTCC

I am aware that there are multiple issues here. One of them could possibly be the formatting of test4b.csv? Is it a valid csv?

I could really do with some help here. Suggestions/help/examples are welcome.

bash awk csv • 508 views
ADD COMMENT
1
Entering edit mode

Looks like you are trying to reverse complement the index sequences for use with bcl2fastq.

Have you considered using bcl-convert instead. It is smarter than bcl2fastq (i.e. it will automatically do RC part depending on the sequencer being used) and faster. It is also the "new" (not so much anymore) software that Illumina will keep supporting.

ADD REPLY
0
Entering edit mode

Hi GenoMax Yes, I am aware of tools/ways to get the RC done easily, but I am learning bash scripting and this was an easy enough challenge for me to get familiar with awk etc.

Thanks.

ADD REPLY
1
Entering edit mode
3 months ago

using a mix of paste and awk

paste -d,  <(echo && cat test4db.csv ) test1.csv | awk -F, -vC=index2 '(NR==1){OFS=",";j=-1;for(i=1;i<=NF;i++) if($i==C) j=i;print;next;} {$j=$1;print}' | cut -d, -f 2-

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,O_bru,1,A01,UDI_55,GACAGTAA,UDI_55,GCGATTAC,,
2,A_lych,1,B01,UDI_56,CCTTCGCA,UDI_56,CATCACTT,,
3,A_litu,1,C01,UDI_57,CATGATCG,UDI_57,TACTCTCC,,
  • echo && cat test4db.csv : add a fake header to test4db
  • paste -d, : use paste on both 'files' using comma as delimited
  • (NR==1){OFS=",";j=-1;for(i=1;i<=NF;i++) if($i==C) j=i;print;next;} first line: search the index j of the variable 'C'
  • {$j=$1;print} replace the column j with the content of the first column
  • cut -d, -f 2- : remove first column
ADD COMMENT
0
Entering edit mode

Hi Pierre, this worked brilliantly! Thank you so much!

I have a few questions: Are you calling the first column of test4db.csv with i=1 ? So in this case, and other cases (in my script) this would work all the time, because test4db.csv is the output of awk extracting one column from a file so, there will always be just one column in test4db. However, if there were more than one column in test4db.csv, do you think it is feasible?

I would set a variable that captures the column in test4db.csv and use it in the line of code you wrote as an additional input at the beginning. I have not had time to work on it yet or test it, though.

Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 1842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6