Merging two files based on matched column and print only the matched lines
1
1
Entering edit mode
3.2 years ago
munaj86 ▴ 30

Hi,

I have two files and I would like to merge them baed on a matched column which is the first column, but I would like to print lines that matched only in one line.

The first file:

tss log2fc  prim    recu
chr1:11869  0   0   0
chr1:12010  0   0   0
chr1:29570  0.24435624  0.79209 0.940127

The second file:

<PromoterChr>:1000  <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>    <TranscriptID>
chr1:11869  chr1    10869   12869   ENSG00000223972.5   ENST00000456328.2
chr1:12010  chr1    11010   13010   ENSG00000223972.5   ENST00000450305.2
chr1:29570  chr1    28570   30570   ENSG00000227232.5   ENST00000488147.1

The desired output:

<PromoterChr>:1000  <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>    <TranscriptID> tss  log2fc  prim    recu
chr1:11869  chr1    10869   12869   ENSG00000223972.5   ENST00000456328.2  chr1:11869   0   0   0
chr1:12010  chr1    11010   13010   ENSG00000223972.5   ENST00000450305.2 chr1:12010    0   0   0

I've seen different posts where they post join commands that do this but it print everything without discarding the mismatched lines. I want to discard the mismatch line/data and print/merge lines with the matched first column.

Any suggestions?

Thanks,

Linux • 1.4k views
ADD COMMENT
0
Entering edit mode

posted expected output is incorrect given your joining criteria. There are several tools which do the required job. Refer to tsv-join or csvjoin functions for respective packages.

$ tsv-join -H -f file1.txt -k 1 file2.txt -a "*" -w 1
<PromoterChr>:1000  <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>    <TranscriptID>  tss log2fc  prim    recu
chr1:11869  chr1    10869   12869   ENSG00000223972.5   ENST00000456328.2   chr1:11869  0   0   0
chr1:12010  chr1    11010   13010   ENSG00000223972.5   ENST00000450305.2   chr1:12010  0   0   0
chr1:29570  chr1    28570   30570   ENSG00000227232.5   ENST00000488147.1   chr1:29570  0.24435624  0.79209 0.940127

if you want selected column only, in output, try this::

$ tsv-join -H -f file1.txt -k 1 file2.txt -a log2fc,prim,recu -w 1
ADD REPLY
0
Entering edit mode

Is this applicable if my files are txt file?

ADD REPLY
0
Entering edit mode

Yes, if they are exactly in the same format as the data you posted.

ADD REPLY
1
Entering edit mode
3.2 years ago

try csvtk join

$ csvtk join -t b.tsv a.tsv 
<PromoterChr>:1000      <PromoterChr>   <PromoterStart> <PromoterEnd>   <GeneID>        <TranscriptID>  log2fc  prim    recu
chr1:11869      chr1    10869   12869   ENSG00000223972.5       ENST00000456328.2       0       0       0
chr1:12010      chr1    11010   13010   ENSG00000223972.5       ENST00000450305.2       0       0       0
chr1:29570      chr1    28570   30570   ENSG00000227232.5       ENST00000488147.1       0.24435624      0.79209 0.940127

$ csvtk join -t b.tsv a.tsv | csvtk pretty -t
<PromoterChr>:1000   <PromoterChr>   <PromoterStart>   <PromoterEnd>   <GeneID>            <TranscriptID>      log2fc       prim      recu
------------------   -------------   ---------------   -------------   -----------------   -----------------   ----------   -------   --------
chr1:11869           chr1            10869             12869           ENSG00000223972.5   ENST00000456328.2   0            0         0
chr1:12010           chr1            11010             13010           ENSG00000223972.5   ENST00000450305.2   0            0         0
chr1:29570           chr1            28570             30570           ENSG00000227232.5   ENST00000488147.1   0.24435624   0.79209   0.940127
ADD COMMENT

Login before adding your answer.

Traffic: 2880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6