Question

How can I find matches between two csv files according to multiple columns using awk/bash?

0

Entering edit mode

3.3 years ago

Gargantu8 • 0

*Reposted question with no images, removed useless information.

I want to compare File 1 and 2 in two ways:

First, match "subject id" in File 2 with "seqnames" in File 1
Then, see if File 2's s.start-s.end range overlaps or falls within File 1's start and end range.

If such matches occur, I would like to print out columns 2-23 from File 1, and then every column but #2 (subject id) from File 2 to the right of that.

File 1:

strand  seqnames      start  end    width  [18moreColumns]
1       Ha412HOChr01  1000   10000  9000    ...
2       Ha412HOChr01  19000  21000  2000    ...
3       Ha412HOChr08  2000   11000  9000    ...
4       Ha412HOChr17  92500  94400  1900    ...

File 2:

query id  subject id    % identity  alignment  mismatch  gap open  s.start  s.end  [3 more columns]
XP_02299  Ha412HOChr01  100         335        0         0         2800     3200   ...
XP_02356  Ha412HOChr01  99.86       189        2         1         18900    19800  ...
XP_04271  Ha412HOChr03  98.21       431        4         0         2100     2432   ...
XP_05788  Ha412HOChr08  93.29       38         0         0         7600     8210   ...
XP_06354  Ha412HOChr08  97.89       98         0         0         32100    34980  ...

Output:

seqnames      start  end    width  [18moreColumns]  query id  % identity  alignment  mismatch  gap open  s.start  s.end  [3 more columns]
Ha412HOChr01  1000   10000  9000    ...             XP_02299  100         335        0         0         2800     3200   ...
Ha412HOChr01  19000  21000  2000    ...             XP_02356  99.86       189        2         1         18900    19800  ...
Ha412HOChr08  2000   11000  9000    ...             XP_05788  93.29       38         0         0         7600     8210   ...

awk genomics bash • 1.9k views

ADD COMMENT • link updated 3.1 years ago by lakhujanivijay 5.9k • written 3.3 years ago by Gargantu8 • 0

0

Entering edit mode

using sqldf in R:

> sqldf('select * from df1 join df2 on df1.seqnames= df2.subject_id where 
+     df2.s_start between df1.start and df1.end OR
+     df2.s_end between df1.start and df1.end')

  strand     seqnames start   end width query_id   subject_id x_identity alignment mismatch gap_open s_start s_end
1      1 Ha412HOChr01  1000 10000  9000 XP_02299 Ha412HOChr01     100.00       335        0        0    2800  3200
2      2 Ha412HOChr01 19000 21000  2000 XP_02356 Ha412HOChr01      99.86       189        2        1   18900 19800
3      3 Ha412HOChr08  2000 11000  9000 XP_05788 Ha412HOChr08      93.29        38        0        0    7600  8210

using fuzzyjoin in R (default inner_join):

> names(df2)[c(2,7,8)]=c("seqnames","start","end")
> genome_join(df1, df2, by=c("seqnames","start","end"))
  strand   seqnames.x start.x end.x width query_id   seqnames.y x_identity alignment mismatch gap_open start.y end.y
1      1 Ha412HOChr01    1000 10000  9000 XP_02299 Ha412HOChr01     100.00       335        0        0    2800  3200
2      2 Ha412HOChr01   19000 21000  2000 XP_02356 Ha412HOChr01      99.86       189        2        1   18900 19800
3      3 Ha412HOChr08    2000 11000  9000 XP_05788 Ha412HOChr08      93.29        38        0        0    7600  8210

Please do not use special characters in column headers (%, space, dot etc).

ADD REPLY • link 3.3 years ago by cpad0112 21k

0

Entering edit mode

This is purely informatics based question, rather bioinformatics!

When in doubt with tabular data, pandas is best used.

ADD REPLY • link 3.1 years ago by lakhujanivijay 5.9k

score 0 · Answer 1 · 2021-07-31

You can use intersectbed for this purpose. It can work on BAM/BED/GFF/VCF formats. However if you don't have these formats as can be seen by file shared by you, then try to make files in a way that you have IDs to be compared in 1st column and start, end in 2nd and 3rd column and rest information in following columns.