Question

Merging two files based on matched column and print only the matched lines

1

Entering edit mode

3.2 years ago

munaj86 ▴ 30

Hi,

I have two files and I would like to merge them baed on a matched column which is the first column, but I would like to print lines that matched only in one line.

The first file:

tss log2fc  prim    recu
chr1:11869  0   0   0
chr1:12010  0   0   0
chr1:29570  0.24435624  0.79209 0.940127

The second file:

<PromoterChr>:1000  <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>    <TranscriptID>
chr1:11869  chr1    10869   12869   ENSG00000223972.5   ENST00000456328.2
chr1:12010  chr1    11010   13010   ENSG00000223972.5   ENST00000450305.2
chr1:29570  chr1    28570   30570   ENSG00000227232.5   ENST00000488147.1

The desired output:

<PromoterChr>:1000  <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>    <TranscriptID> tss  log2fc  prim    recu
chr1:11869  chr1    10869   12869   ENSG00000223972.5   ENST00000456328.2  chr1:11869   0   0   0
chr1:12010  chr1    11010   13010   ENSG00000223972.5   ENST00000450305.2 chr1:12010    0   0   0

I've seen different posts where they post join commands that do this but it print everything without discarding the mismatched lines. I want to discard the mismatch line/data and print/merge lines with the matched first column.

Any suggestions?

Thanks,

Linux • 1.4k views

ADD COMMENT • link updated 3.2 years ago by cpad0112 21k • written 3.2 years ago by munaj86 ▴ 30

0

Entering edit mode

posted expected output is incorrect given your joining criteria. There are several tools which do the required job. Refer to tsv-join or csvjoin functions for respective packages.

$ tsv-join -H -f file1.txt -k 1 file2.txt -a "*" -w 1
<PromoterChr>:1000  <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>    <TranscriptID>  tss log2fc  prim    recu
chr1:11869  chr1    10869   12869   ENSG00000223972.5   ENST00000456328.2   chr1:11869  0   0   0
chr1:12010  chr1    11010   13010   ENSG00000223972.5   ENST00000450305.2   chr1:12010  0   0   0
chr1:29570  chr1    28570   30570   ENSG00000227232.5   ENST00000488147.1   chr1:29570  0.24435624  0.79209 0.940127

if you want selected column only, in output, try this::

$ tsv-join -H -f file1.txt -k 1 file2.txt -a log2fc,prim,recu -w 1

ADD REPLY • link 3.2 years ago by cpad0112 21k

0

Entering edit mode

Is this applicable if my files are txt file?

ADD REPLY • link 3.2 years ago by munaj86 ▴ 30

0

Entering edit mode

Yes, if they are exactly in the same format as the data you posted.

ADD REPLY • link 3.2 years ago by cpad0112 21k

score 1 · Answer 1 · 2021-09-27

try csvtk join

$ csvtk join -t b.tsv a.tsv 
<PromoterChr>:1000      <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>        <TranscriptID>  log2fc  prim    recu
chr1:11869      chr1    10869   12869   ENSG00000223972.5       ENST00000456328.2       0       0       0
chr1:12010      chr1    11010   13010   ENSG00000223972.5       ENST00000450305.2       0       0       0
chr1:29570      chr1    28570   30570   ENSG00000227232.5       ENST00000488147.1       0.24435624      0.79209 0.940127

$ csvtk join -t b.tsv a.tsv | csvtk pretty -t
<PromoterChr>:1000   <PromoterChr>   <PromoterStart>   <PromoterEnd>   <GeneID>            <TranscriptID>      log2fc       prim      recu
------------------   -------------   ---------------   -------------   -----------------   -----------------   ----------   -------   --------
chr1:11869           chr1            10869             12869           ENSG00000223972.5   ENST00000456328.2   0            0         0
chr1:12010           chr1            11010             13010           ENSG00000223972.5   ENST00000450305.2   0            0         0
chr1:29570           chr1            28570             30570           ENSG00000227232.5   ENST00000488147.1   0.24435624   0.79209   0.940127