how to intersect two files based on three columns
3
0
Entering edit mode
3.7 years ago
Apex92 ▴ 320

I have two files as below. I want to search for every item in every column in the f1 and see it if exists inside column 3 of the f2 file (for each column of f1 separately). If any exists then write the related chr of that item from the chr column of f2 (the first column of f2).

f1:(two columns)

gene_symbol,lncRNA_symbol
kctd,Malat1
Gas5,Snhg6

f2:(three columns)

chr,gene_type,gene_symbol
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd

So based on the two files mentioned the desired output should be:

f3:

gene_symbol,lncRNA_symbol,gene_chr,lncRNA_chr
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

I do not have a strong programming background and found this very complex - Even though I have checked multiple sources but have not been able to develop a solution - any help is appreciated. Thanks

python awk bash • 1.4k views
ADD COMMENT
2
Entering edit mode
3.7 years ago
Zhilong Jia ★ 2.2k

awk 'BEGIN{FS=OFS=","}NR==FNR{gene[$3]=$1; next}{ print ($1, $2, gene[$1], gene[$2] ) }' 2.t 1.t

gene_symbol,lncRNA_symbol,chr,
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

cat 1.t

gene_symbol,lncRNA_symbol
kctd,Malat1
Gas5,Snhg6

cat 2.t

chr,gene_type,gene_symbol
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd
ADD COMMENT
1
Entering edit mode
3.7 years ago

This tries to deal with cases where information might be missing:

#!/usr/bin/env python

""" so9463765.py -- https://www.biostars.org/p/9463765/ """

import sys
import os
import pandas as pd

f1 = "f1.txt"
f2 = "f2.txt"

if not os.path.exists(f1) or not os.path.exists(f2):
    raise OSError("File(s) not found")

h = [ "gene_symbol", "lncRNA_symbol", "gene_chr", "lncRNA_chr" ]
s = {}
with open(f1, "r") as f1h, open(f2, "r") as f2h:
    f1i = 0
    for line in f1h:
        if f1i > 0:
            (gene_symbol,lncRNA_symbol) = line.rstrip().split(',')
            s[gene_symbol] = { "v" : [gene_symbol,lncRNA_symbol], "t" : "gene_symbol" }
            s[lncRNA_symbol] = { "v" : [gene_symbol,lncRNA_symbol], "t" : "lncRNA_symbol" }
        f1i += 1
    ol = []
    f2i = 0
    for line in f2h:
        if f2i > 0:
            (chrom, gene_type, gene_symbol) = line.rstrip().split(',')
            o = {x: "" for x in h}
            if gene_symbol in s:
                if s[gene_symbol]["t"] == "gene_symbol":
                    o["gene_symbol"] = gene_symbol
                    o["gene_chr"] = chrom
                    o["lncRNA_symbol"] = s[gene_symbol]["v"][1]
                elif s[gene_symbol]["t"] == "lncRNA_symbol":
                    o["gene_symbol"] = s[gene_symbol]["v"][0]
                    o["lncRNA_symbol"] = gene_symbol
                    o["lncRNA_chr"] = chrom
            ol.append(o)
        f2i += 1
    df = pd.DataFrame(ol)
    # collapse by column values
    of = df.groupby(["gene_symbol", "lncRNA_symbol"], as_index=False).agg("".join)
    sys.stdout.write(of.to_csv(index=False))

Output:

% ./so9463765.py
gene_symbol,lncRNA_symbol,gene_chr,lncRNA_chr
Gas5,Snhg6,chr7,chr2
kctd,Malat1,chr1,chr1

A bit convoluted, but hopefully this gets you somewhere.

ADD COMMENT
0
Entering edit mode
3.7 years ago
5heikki 11k

I'm not sure this is parsing the exact fields you wanted (you set those with the -o in join):

join -1 1 -2 3 -t "," -o 1.1,1.2,2.1 <(sort -t "," -k1,1 f1) <(sort -t "," -k3,3 f2)
ADD COMMENT

Login before adding your answer.

Traffic: 2495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6