Entering edit mode
3.3 years ago
Gargantu8
•
0
*Reposted question with no images, removed useless information.
I want to compare File 1 and 2 in two ways:
- First, match "subject id" in File 2 with "seqnames" in File 1
- Then, see if File 2's s.start-s.end range overlaps or falls within File 1's start and end range.
If such matches occur, I would like to print out columns 2-23 from File 1, and then every column but #2 (subject id) from File 2 to the right of that.
File 1:
strand seqnames start end width [18moreColumns]
1 Ha412HOChr01 1000 10000 9000 ...
2 Ha412HOChr01 19000 21000 2000 ...
3 Ha412HOChr08 2000 11000 9000 ...
4 Ha412HOChr17 92500 94400 1900 ...
File 2:
query id subject id % identity alignment mismatch gap open s.start s.end [3 more columns]
XP_02299 Ha412HOChr01 100 335 0 0 2800 3200 ...
XP_02356 Ha412HOChr01 99.86 189 2 1 18900 19800 ...
XP_04271 Ha412HOChr03 98.21 431 4 0 2100 2432 ...
XP_05788 Ha412HOChr08 93.29 38 0 0 7600 8210 ...
XP_06354 Ha412HOChr08 97.89 98 0 0 32100 34980 ...
Output:
seqnames start end width [18moreColumns] query id % identity alignment mismatch gap open s.start s.end [3 more columns]
Ha412HOChr01 1000 10000 9000 ... XP_02299 100 335 0 0 2800 3200 ...
Ha412HOChr01 19000 21000 2000 ... XP_02356 99.86 189 2 1 18900 19800 ...
Ha412HOChr08 2000 11000 9000 ... XP_05788 93.29 38 0 0 7600 8210 ...
using sqldf in R:
using fuzzyjoin in R (default inner_join):
Please do not use special characters in column headers (%, space, dot etc).
This is purely informatics based question, rather bioinformatics!
When in doubt with tabular data, pandas is best used.