Question

Loop through comma separated values and find match from another file

0

Entering edit mode

2.3 years ago

L_bioinfo • 0

Two files: both tab-delimited

file 1:

gene1    ID2
gene2    ID47
gene3    ID58

file 2

rd1    homolog_to=ID2, TH23
rd2    homolog_to=ID3, ID44, TH33, ID47

Output

rd1    homolog_to=ID2, TH23    gene1
rd2    homolog_to=ID3, ID44, TH33, ID47     gene2
`

I want column 2 in file 1 loop through comma-separated column 2 in file2. If a match is found I want column 1 in file 1 to be reported

code:

`BEGIN { FS=OFS="\t" }
NR==FNR {
    a[$2]=$1
    next
}
{
    var = "N/A_Gene"
    split($10,vals,/[=,]/)
    for ( i=2; i in values; i++ ) {
        val = values[i]
        if ( val in a ) {
            var = val[$n]
            break
        }
    }
    print $0, var
}

My code has some glitches. Please provide pointers on this

awk • 1.6k views

ADD COMMENT • link updated 2.3 years ago by Pierre Lindenbaum 166k • written 2.3 years ago by L_bioinfo • 0

score 2 · Answer 1 · 2023-05-08

2

Entering edit mode

2.3 years ago

Pierre Lindenbaum 166k

awk -F '\t' '{N=split($2,a,/[=, ]*/);for(i=2;i<=N;i++) printf("%s\t%s\n",a[i],$0);}' input2.tsv  | sort -t $'\t' -k1,1 | join -t $'\t' -1 1 -2 2 - <(sort -t $'\t' -k2,2 input1.tsv) | cut -f 2-

rd1 homolog_to=ID2, TH23    gene1
rd2 homolog_to=ID3, ID44, TH33, ID47    gene2

ADD COMMENT • link 2.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you, the code worked perfectly. Could you please explain the part from the printf statement as I'm fairly new to this?

ADD REPLY • link 2.3 years ago by L_bioinfo • 0

2

Entering edit mode

chatGPT:

Sure, here's an explanation of the bash command:

awk -F '\t' '{N=split($2,a,/[=, ]*/);for(i=2;i<=N;i++) printf("%s\t%s\n",a[i],$0);}' input2.tsv

awk: calls the awk utility to process the input file(s)
-F '\t': specifies the delimiter used in the input file as a tab character
'{N=split($2,a,/[=, ]*/);for(i=2;i<=N;i++) printf("%s\t%s\n",a[i],$0);}': is the awk script that does the following:
- N=split($2,a,/[=, ]*/): splits the second field in the input file using any of the characters =, , or space as a delimiter, and stores the resulting fields in the a array. The N variable is set to the number of fields returned by the split function.
- for(i=2;i<=N;i++) printf("%s\t%s\n",a[i],$0);: loops through the a array starting at index 2 and prints each element along with the entire input line (represented by $0) separated by a tab character.

| sort -t $'\t' -k1,1 | join -t $'\t' -1 1 -2 2 - <(sort -t $'\t' -k2,2 input1.tsv) | cut -f 2-

|: pipes the output of the previous command to the next command
sort -t $'\t' -k1,1: sorts the output of the previous command based on the first field (which is assumed to be separated by a tab character)
join -t $'\t' -1 1 -2 2 - <(sort -t $'\t' -k2,2 input1.tsv): joins the sorted output with another input file input1.tsv, where the join is performed based on the second field in the sorted output and the first field in input1.tsv. The -t $'\t' option specifies that the tab character is used as the field delimiter, -1 1 specifies that the first field in the sorted output is used for the join, and -2 2 specifies that the second field in input1.tsv is used for the join. The - option specifies that the input for the second file is read from standard input (which is the output of the previous command).
cut -f 2-: selects only the second field and onwards from the joined output, discarding the first field (which is the join key).

Overall, the command takes the contents of two tab-separated value (TSV) files as input (input1.tsv and input2.tsv). It splits the second field of input2.tsv based on certain delimiters and outputs each resulting field along with the entire line as a separate record. The resulting records are then sorted by the first field, joined with input1.tsv based on the second field in the sorted output and the first field in input1.tsv, and finally outputs only the second field onwards from the joined output.

ADD REPLY • link 2.3 years ago by Pierre Lindenbaum 166k