A very fast and simple solution that doesn't appear to require sorting (in your example data) is to use awk to pre-process each line into a BED file, where pseudo-IDs are made from the third and subsequent fields. We can then use BEDOPS to map the merged files onto themselves, where there are overlaps. The resulting self-mapped elements can be post-processed back to the desired end format.
For instance, we can make one-base BED files from your inputs:
$ awk '{print $1"\t"$2"\t"($2 + 1)"\t"$3"_"$4"_"$5"_"$6}' setA.txt > setA.bed
$ awk '{print $1"\t"$2"\t"($2 + 1)"\t"$3"_"$4"_"$5"_"$6}' setB.txt > setB.bed
These are sorted elements, so we can union them directly:
$ bedops --everything setA.bed setB.bed > unionAB.bed
Then we map the set union against itself, with a bash
substitution to strip the ID column from reference elements for convenience:
$ bedmap --echo --echo-map-id --faster --delim '\t' <(cut -f1-3 unionAB.bed) unionAB.bed | uniq > unionABSelfMapped.bed
Let's take a quick look at unionABSelfMapped.bed
:
$ less unionABSelfMapped.bed
scaffold_1 31 32 1_2_20_0;0_0_0_20
scaffold_1 32 33 19_1_0_0
scaffold_1 261 262 20_0_0_0
We can post-process this self-mapped result into a format that matches your desired result, skipping un-merged elements and turning the ID field back into a series of tab-delimited strings:
$ awk '{ \
fsn = split($4, fs, ";"); \
if (fsn > 1) { \
split(fs[1], fs1, "_"); \
split(fs[2], fs2, "_"); \
printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, fs1[1], fs1[2], fs1[3], fs1[4], fs2[1], fs2[2], fs2[3], fs2[4]); \
} \
}' unionABSelfMapped.bed > answer.txt
The result:
$ less answer.txt
scaffold_1 31 1 2 20 0 0 0 0 20
The reason I suggest this approach is two-fold:
- Walking through files and processing them one line at a time requires very very little memory and is relatively fast, especially at the multi-GB scale!
- Sorting is very expensive at that scale, because unless you have huge amounts of system memory to throw at sorting, you usually have to use temporary files - and disk I/O is very expensive! If your inputs are already ordered, you can pre-process them into sorted BED files very cheaply and start doing very fast and very low-memory BEDOPS operations immediately.
With further work, parts of this approach can pipelined using standard input and output streams, which would reduce the amount of expensive disk I/O operations and make this process even faster!
Are both of these files already sorted on the first two columns? How badly do you need a solution in AWK?
Hi, Thanks for your quick reply!
The files are not sorted and there are rows in file1 that are not present in file2 and vice versa. I don't need to use awk, I just couldn't find another solution. If you have one I would love to hear it, thanks!
Thanks!! Do you have any tips to get me started, I've downloaded SQLite but no idea where to begin!
I mean, two 15Gb SQLite tables with an index on both for the first two columns is going to be close to 50+ Gb. MySQL/Postgres will be even bigger. Databases are a very simple and clean solution (which could help down the road), but it will be pretty slow too. One sec, I'll write some quick and dirty Python to do this..
That would be amazing thanks so much! I've tried for the last couple of days to write a python script but it was running for 7+ hours!
Here is a python script that doesn't need to sort:
I would be interested to know how it performs compared to the sort-based implimentations. Use it like:
You can also specify an
--output
directory, and a--delimiter
if its not tabNOTE: It keeps the rows unique to each file - which you said should never happen. So the output of the demo is:
To change that to drop the shorter rows is trivial - but I would be inclined to leave it as it is then test later that all rows have the same number of columns with something like:
Usually shell cmds (implemented by C), including AWK, are quicker than scripts implemented by scripting language like Python for simple problems. Please @AW paste the time comparison if interested.
I definitely will thanks! Your suggestion has been working very well but unfortunately the sort step is taking an extremely long time. I googled this and it seems that I might expect the sort step to run for several hours. Do you have any alternatives for this?
using
--parallel
option insort
.ref: http://unix.stackexchange.com/questions/120096/how-to-sort-big-files
Yes I saw this but it didn't work for me. And when I ran sort --help there was no option for parallel? Any idea why this might be? Thanks so much for your help!!!!
update your
sort
.Thanks I'll try that!
Ah, im not so sure Zhilong. It really depends on who writes the script and what the problem is.
In this example, i'm very confident that the Python will be several times faster than the Unix pipeline, because there's no sorting required. The join is done in RAM because it will fit after being split into chromosomes. There is no way to do that by combining different Unix tools, and so even though they are faster at what they do individually - they don't allow you to take shortcuts. Some might argue that they prevent you from even "seeing" the shortcuts, because you are limited by the tools in your toolbox.
ha, I pasted the awk version, too. Here the problem is memory limit (similar to python).
Amazing thank you very much for this!! I have started running it now and will let you know how it goes. There are many rows that are unique to each file and I'm not really interested in these so I may try to remove these later. Thanks again!
Oh whoops i misread your comment about rows in file1 not in file2. I read they are unsorted but the first two columns always appear in both. If id known, i'd have added an extra line to not write those lines in the output. Soz!
No worries! I'll have a go at modifying the script but might ask for some advice if I get stuck! Thanks!
Ah it's OK, it's only a two second tweak: http://paste.ofcode.org/qHkDxaa2M7KGc4ddZa9u57
This will now only keep the rows where every input file was combined. It will run quicker than before too, since there's less data to write out :P
But you"ll need to keep a look out for how much RAM is being used. It should be around the size of all the chromosome 1s combined. A structured numpy array or something it would be a lot smaller, but because that comes with a lot of nasty edge cases it would take me the whole day to write and debug so this will have to do :P
Wow thanks that's great!
I ran into an error!
Any ideas?
Ah, I'm guessing you're on a Mac? They have a really low limit for the maximum number of open files (256 by default), while many Linux PCs will have a default of 65536. I always do this with chromosomes so I never go above 60-80 contigs, but if you have a lot of scaffolds I guess you will hit it. You can check with:
and if its low you can increase it like
This change only lasts for the current terminal session, and will have to be re-applied when you open a new terminal. You can set it system-wide, but I wouldnt. Macs dont like it when you change their internals.
Hope that helps mate :/ I feel bad that your trying to do this simple thing, but you keep running up against these walls :P
On the plus side, having more than 256 scaffolds means that the in-memory join will be very unlikely to run out of RAM :)