Hello!
In case this question is already answered I apologize in advance.
I have 2 very big tsv files with some values of one column from one file matching the values of one column from the other file. Based on that I want to make a new tsv file.
1st file
ERZ871266.fasta_contig1 unclassified (taxid 0)
ERZ871266.fasta_contig2 Aeromicrobium choanae (taxid 1736691)
ERZ871266.fasta_contig4 Clostridioides difficile (taxid 1496)
ERZ871266.fasta_contig6 unclassified (taxid 0)
........
2nd file
/home/results/ERZ940766.fasta ERZ871266.fasta_contig1
/home/results/ERZ940766.fasta ERZ871266.fasta_contig2
/home/results/ERZ940766.fasta ERZ871266.fasta_contig3
/home/results/ERZ940766.fasta ERZ871266.fasta_contig4
/home/results/ERZ940766.fasta ERZ871266.fasta_contig5
/home/results/ERZ940766.fasta ERZ871266.fasta_contig6
........
What I want to do is the following:
ERZ871266.fasta_contig1 unclassified (taxid 0) /home/results/ERZ940766.fasta
ERZ871266.fasta_contig2 Aeromicrobium choanae (taxid 1736691) /home/results/ERZ940766.fasta
ERZ871266.fasta_contig4 Clostridioides difficile (taxid 1496) /home/results/ERZ940766.fasta
ERZ871266.fasta_contig6 unclassified (taxid 0) /home/results/ERZ940766.fasta
........
Thanking you in advance!
What have you tried? A simple search on Stack Overflow will reveal multiple ways of doing this.
Yes, you are right. I forgot to mention what i did.
I played mostly with the join command but I am getting an error saying that my files are not sorted (even though I used the sort command beforehand based on the specific columns). I also don't have duplicates in my files
Show us what you did as well as the exact error you face.
join ... <(sort ... file1) <(sort ... file2)
with the appropriate params should work.Initially I tried:
and I got the following error:
Then I tried with the
--nocheck-order
With that I got an output but the file had missing values. For example:
Maybe it has to do with the fact that not all values are matching?
you can use dplyr join function in R (e.g left_join( ))
Why do you recommend
left_join
when file2 seemingly has more values and if anything,inner_join
should be preferred when criteria are unclear?Edited..depends on how OP wants to join the data and retain columns.