Entering edit mode
4.2 years ago
marit.hetland
▴
50
Hi, I have output from snp-dists (https://github.com/tseemann/snp-dists) in molten format, e.g.:
seq1 seq2 1
seq1 seq3 2
seq2 seq1 1
seq2 seq3 3
seq3 seq1 2
seq3 seq2 3
The third column gives the number of SNPs between the pair of sequences given in columns 1 and 2. As you can see, these values are duplicated, as it shows both the combination seq1 seq2 and seq2 seq1. How can I (in R or bash preferably) remove the duplicate values?
Let's do code golf with benchmarks, here is my Python version if we are at it:
Benchmark: a file with 1 million entries (file size 1.7MB)
Python code above took 0.1 seconds and 18MB RAM.
The awk version took 0.3 seconds and used about 14 MB RAM
First version of the R code took 0.5 seconds and used about 400MB of RAM.
Simpler R code took 3 seconds and used about 400MB of RAM.