Hi, I have two large files, one is the list of SNPs (file1) and another its annotation file (file2). Please help me to write a code for the following analysis. I am looking to fetch matching data in two files if the data in column first and second are found to match and print the entire row of files (first and second).
I tried the following command but it just prints the data of the second file. However, I want to print matching data of both files (first and second).
awk -F'|' 'NR==FNR{c[$1,$2]++;next};c[$1,$2] > 0' file1.txt file2.txt >out.txt
For example:
File 1:
chr1 9133639 T CMD
chr2 6134363 C FFP
chr4 6344639 A FFP
File 2:
chr1 9133639 T GI_02334
chr2 6134363 C GI_02338
chr4 6344639 A GI_02365
chr1 7133739 A GI_02339
chr2 5134763 C GI_02389
chr4 4344639 T GI_04365
Expected Output:
chr1 9133639 T CMD chr1 9133639 T GI_02334
chr2 6134363 C FFP chr2 6134363 C GI_02338
chr4 6344639 A FFP chr4 6344639 A GI_02365
How large of files are we talking? And does it need to be performed in the shell?
My solution to this would be to import the files into R and combine the data together with a
left_join()
by chr, position, and nucleotide.Otherwise you might be able to do what you want with the
join
command, but it might take some extra work sincejoin
requires that you only join by one field and that the files are sorted by the key columnThank you for your reply. A total number 1500 of records in SNPs (file 1) and 593337 records are in annotation file (file2). It is not mandatory to be performed in shell script. However, if there is a shell or python/perl script, it would be best. Could you please elaborate the left_join() to use in R.