It's fast as compared to a sort-based approach, but one should be careful building a hash table from very large datasets, to be sure that one has a computer with sufficient memory to store the intermediate hash table.
Another option with very large datasets is to sort the input. It is easy to remove duplicates from a sorted list. For non-BED files, one could specify LC_ALL=C
and use sort | uniq
or, better, to use sort -u
to get uniques.
Sorting takes time, but it usually uses far less memory. Setting LC_ALL=C
treats input as if it has single-byte characters, which speeds up sorting considerably. This will almost always work for genomic data, which rarely contains two- or four-byte characters such as those found in extended Unicode.
Processing of multibyte characters requires more resources and is slower. If you tell your computer to assume the input has single-byte characters, fewer resources are needed.
If you're sorting BED files (like your sample TSV file, minus the header line), one could use a sort-bed - | uniq
approach. The sort-bed
tool uses some tricks to be faster than GNU sort
at sorting BED files.
awk '!a[$1$2$3]++'
is not ok for this data, some special joinning symbols are needed.Kevin, it will be great if you can explain this to us.
Hi Vijay, sorry that I did not give any sample data.
If we have the following data in MyData.tsv:
awk '!a[$1]++' MyData.tsv
(using column #1 as key) will produce:awk '!a[$1$2$3$4$5]++' MyData.tsv
(using all columns as key) will produce:It is mainly useful for very large datasets of any type when you want to remove any duplicate rows
This is a neat trick, thank you!