Entering edit mode
18 months ago
L_bioinfo
•
0
Two files: both tab-delimited
file 1:
gene1 ID2
gene2 ID47
gene3 ID58
file 2
rd1 homolog_to=ID2, TH23
rd2 homolog_to=ID3, ID44, TH33, ID47
Output
rd1 homolog_to=ID2, TH23 gene1
rd2 homolog_to=ID3, ID44, TH33, ID47 gene2
`
I want column 2 in file 1 loop through comma-separated column 2 in file2. If a match is found I want column 1 in file 1 to be reported
code:
`BEGIN { FS=OFS="\t" }
NR==FNR {
a[$2]=$1
next
}
{
var = "N/A_Gene"
split($10,vals,/[=,]/)
for ( i=2; i in values; i++ ) {
val = values[i]
if ( val in a ) {
var = val[$n]
break
}
}
print $0, var
}
My code has some glitches. Please provide pointers on this
Thank you, the code worked perfectly. Could you please explain the part from the printf statement as I'm fairly new to this?
chatGPT:
Sure, here's an explanation of the bash command:
awk -F '\t' '{N=split($2,a,/[=, ]*/);for(i=2;i<=N;i++) printf("%s\t%s\n",a[i],$0);}' input2.tsv
awk
: calls the awk utility to process the input file(s)-F '\t'
: specifies the delimiter used in the input file as a tab character'{N=split($2,a,/[=, ]*/);for(i=2;i<=N;i++) printf("%s\t%s\n",a[i],$0);}'
: is the awk script that does the following:N=split($2,a,/[=, ]*/)
: splits the second field in the input file using any of the characters=
,,
or space as a delimiter, and stores the resulting fields in thea
array. TheN
variable is set to the number of fields returned by the split function.for(i=2;i<=N;i++) printf("%s\t%s\n",a[i],$0);
: loops through thea
array starting at index 2 and prints each element along with the entire input line (represented by$0
) separated by a tab character.| sort -t $'\t' -k1,1 | join -t $'\t' -1 1 -2 2 - <(sort -t $'\t' -k2,2 input1.tsv) | cut -f 2-
|
: pipes the output of the previous command to the next commandsort -t $'\t' -k1,1
: sorts the output of the previous command based on the first field (which is assumed to be separated by a tab character)join -t $'\t' -1 1 -2 2 - <(sort -t $'\t' -k2,2 input1.tsv)
: joins the sorted output with another input fileinput1.tsv
, where the join is performed based on the second field in the sorted output and the first field ininput1.tsv
. The-t $'\t'
option specifies that the tab character is used as the field delimiter,-1 1
specifies that the first field in the sorted output is used for the join, and-2 2
specifies that the second field ininput1.tsv
is used for the join. The-
option specifies that the input for the second file is read from standard input (which is the output of the previous command).cut -f 2-
: selects only the second field and onwards from the joined output, discarding the first field (which is the join key).Overall, the command takes the contents of two tab-separated value (TSV) files as input (
input1.tsv
andinput2.tsv
). It splits the second field ofinput2.tsv
based on certain delimiters and outputs each resulting field along with the entire line as a separate record. The resulting records are then sorted by the first field, joined withinput1.tsv
based on the second field in the sorted output and the first field ininput1.tsv
, and finally outputs only the second field onwards from the joined output.