Comparing 2 Large Lists (Millions Of Rows) To Identify Shared And Exclusive Elements
8
6
Entering edit mode
11.9 years ago
Bioinfosm ▴ 620

Hi,

Am looking for a fast way to parse through 2 large lists with millions of elements and identify the ones shared by both the lists, exclusive to list1, exclusive to list2. For smaller lists, perl hashes or the venny tool are useful, but I needed to do the same with these huge lists.

FWIW, these lists are actually read-ID from NGS data. The reads were aligned using 2 different approaches, and I wish to investigate of the ~ 30 million reads mapped by both approaches, how many are common, how many and which ones are aligned using only one approach and not the other. I mention this in case there is something in bedtools, bamtools or the like that might be relevant here.

thanks!

EDIT: Thanks for all the responses. I got bogged by other things and never got the chance to check on these. Will work my way through and add further comments/notes..

unix list intersect • 43k views
ADD COMMENT
1
Entering edit mode

If the aligners in use keep all the reads in the input order (bwa/bowtie/soap2 among many others do that by default), you can simply use paste or reading two files line by line.

ADD REPLY
0
Entering edit mode

How much memory do you have? If they are just read ids, 30 million (assuming 30 characters each) will probably take around a gig of memory. It's pretty reasonable.

ADD REPLY
0
Entering edit mode

I've seen python dictionaries go above 10 gigabytes. So, depending on the size and the memory available, I think python could work here.

ADD REPLY
0
Entering edit mode

you don't need dictionary, use set() in python

ADD REPLY
13
Entering edit mode
11.9 years ago

Assuming you are referring to IDs and you have lists that are structured something like the following:

$ cat list-1.txt
foo
bar
baz
zab
oof

$ cat list-2.txt
zab
baz
zit

You can first sort the two files:

$ sort list-1.txt > list-1.sorted.txt
$ sort list-2.txt > list-2.sorted.txt

$ cat list-1.sorted.txt
bar
baz
foo
oof
zab

$ cat list-2.sorted.txt
baz
zab
zit

Now, you can use join to find the elements that are common to both sets:

$ join -1 1 -2 1 list-1.sorted.txt list-2.sorted.txt
baz
zab

Use the -v option to find those elements that are exclusive to set 1:

join -v 1 -1 1 -2 1 list-1.sorted.txt list-2.sorted.txt
bar
foo
oof

Exclusive to set 2:

join -v 2 -1 1 -2 1 list-1.sorted.txt list-2.sorted.txt
zit

Or, you can use the comm command to do it all in one step. The first column are elements exclusive to set 1, the second column is elements that are exclusive to set 2, and the third column is elements that are common to the two sets. join is more flexible, but in simple cases like this, comm can be quite useful.

$ comm list-1.sorted.txt list-2.sorted.txt 
bar
        baz
foo
oof
        zab
    zit
ADD COMMENT
1
Entering edit mode

I hope I don't sound like I'm whining. I don't mean to be. I'm a little surprised that the bash solution is so strongly preferred given that it duplicates the original files in the sorting step. Also, sorting is O(NlogN) and (for me at least) rather painful for millions of reads. I used to use bash for tasks like this, but moved to python for the above reasons.

ADD REPLY
3
Entering edit mode

30 million strings is pushing the limit of memory. Even if you do it right in C (EDIT: with a hash table; there are much more memory-efficient solutions, e.g. using FM-index, but that is overkilling), you need around 1GB to hold them in RAM, depending on the string lengths, and 2-3GB in perl/python due to their overhead. If the strings are longer or OP has more alignments to process in future, loading keys into RAM will be more problematic. fgrep is much worse here. Aaron's solution works with billions of alignments under trivial memory. BTW, for simple joining unsorted list, awk also works: awk 'BEGIN{while((getline<"A.txt")>0)l[$1]=1}l[$1]' B.txt.

ADD REPLY
0
Entering edit mode

Fair enough. My main platform has 8 gigs. RAM is not usually one of my issues. Thanks for the alternative perspective.

ADD REPLY
0
Entering edit mode

When other columns in the files can identify the file number (e.g. 1 file has PG:bowtie and the other has PG:bowtie2), I usually use join -a1 -a2 list-1.sorted.txt list-2.sorted.txt. This puts everything in one data stream.

ADD REPLY
0
Entering edit mode

Thanks Heng, that is indeed a better option.

ADD REPLY
0
Entering edit mode

I like the comm solution.

ADD REPLY
7
Entering edit mode
11.9 years ago
sjneph ▴ 690

Stick to standard unix utilities unless something more fancy is actually needed. This method will be fast and furious.


  grep -F -x -f file1 file2 > common
  grep -v -F -x -f common file1 > file1.only
  grep -v -F -x -f common file2 > file2.only

ADD COMMENT
4
Entering edit mode
11.9 years ago

Here is a java program comparing two BAMs.

I've tested it using the picard library 1.62 and berkeleyDB java edition 4.1 , it should work for some more recent versions of the libraries.

Compilation & execute:

javac -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar Biostar63016.java
mkdir TMP
java -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar:. Biostar63016 -d TMP file1.bam file2.bam

here is a sample of the output (read-name / flag / positions-file1 / positions-file2 ).

$ java -cp ~/package/picard-tools-1.62/picard-1.62.jar:/home/lindenb/package/picard-tools-1.62/sam-1.62.jar:/home/lindenb/package/je-4.1.10/lib/je-4.1.10.jar:. Biostar63016 -d TMP ~/samtools-0.1.18/examples/ex1b.bam ~/samtools-0.1.18/examples/ex1f.bam   |  head -n 50

Feb 06, 2013 10:02:45 PM Biostar63016 run
INFO: in /home/lindenb/samtools-0.1.18/examples/ex1b.bam 1
Feb 06, 2013 10:02:47 PM Biostar63016 run
INFO: in /home/lindenb/samtools-0.1.18/examples/ex1f.bam 1
B7_589:1:101:825:28    NE    (empty)    tid-0:1079,tid-0:879
B7_589:1:101:825:28a    NE    (empty)    tid-0:1079,tid-0:879
B7_589:1:101:825:28b    EQ    tid-0:1079,tid-0:879    tid-0:1079,tid-0:879
B7_589:1:110:543:934    NE    (empty)    tid-1:514,tid-1:700
B7_589:1:110:543:934a    NE    (empty)    tid-1:514,tid-1:700
B7_589:1:110:543:934b    EQ    tid-1:514,tid-1:700    tid-1:514,tid-1:700
B7_589:1:122:337:968    NE    (empty)    tid-0:823,tid-0:981
B7_589:1:122:337:968a    NE    (empty)    tid-0:823,tid-0:981
B7_589:1:122:337:968b    EQ    tid-0:823,tid-0:981    tid-0:823,tid-0:981
B7_589:1:122:77:789    NE    (empty)    tid-0:223,tid-0:396
B7_589:1:122:77:789a    NE    (empty)    tid-0:223,tid-0:396
B7_589:1:122:77:789b    EQ    tid-0:223,tid-0:396    tid-0:223,tid-0:396
B7_589:1:168:69:249    NE    (empty)    tid-1:936,tid-1:1125
B7_589:1:168:69:249a    NE    (empty)    tid-1:936,tid-1:1125
B7_589:1:168:69:249b    EQ    tid-1:936,tid-1:1125    tid-1:936,tid-1:1125
B7_589:1:29:529:379    NE    (empty)    tid-0:1117,tid-0:926
B7_589:1:29:529:379a    NE    (empty)    tid-0:1117,tid-0:926
B7_589:1:29:529:379b    EQ    tid-0:1117,tid-0:926    tid-0:1117,tid-0:926
B7_589:2:30:644:942    NE    (empty)    tid-1:1229,tid-1:1045
B7_589:2:30:644:942a    NE    (empty)    tid-1:1229,tid-1:1045
B7_589:2:30:644:942b    EQ    tid-1:1229,tid-1:1045    tid-1:1229,tid-1:1045
B7_589:2:73:730:487    NE    (empty)    tid-0:604,tid-0:770
B7_589:2:73:730:487a    NE    (empty)    tid-0:604,tid-0:770
B7_589:2:73:730:487b    EQ    tid-0:604,tid-0:770    tid-0:604,tid-0:770
B7_589:2:9:49:661    NE    (empty)    tid-1:591,tid-1:747
B7_589:2:9:49:661a    NE    (empty)    tid-1:591,tid-1:747
B7_589:2:9:49:661b    EQ    tid-1:591,tid-1:747    tid-1:591,tid-1:747
B7_589:3:71:478:175    NE    (empty)    tid-1:171,tid-1:317
B7_589:3:71:478:175a    NE    (empty)    tid-1:171,tid-1:317
B7_589:3:71:478:175b    EQ    tid-1:171,tid-1:317    tid-1:171,tid-1:317
B7_589:3:82:13:897    NE    (empty)    tid-1:606,tid-1:453
B7_589:3:82:13:897a    NE    (empty)    tid-1:606,tid-1:453
B7_589:3:82:13:897b    EQ    tid-1:606,tid-1:453    tid-1:606,tid-1:453
B7_589:4:54:989:654    NE    (empty)    tid-0:1108,tid-0:1296
B7_589:4:54:989:654a    NE    (empty)    tid-0:1108,tid-0:1296
B7_589:4:54:989:654b    EQ    tid-0:1108,tid-0:1296    tid-0:1108,tid-0:1296
B7_589:5:147:405:738    NE    (empty)    tid-1:870,tid-1:1048
B7_589:5:147:405:738a    NE    (empty)    tid-1:870,tid-1:1048
B7_589:5:147:405:738b    EQ    tid-1:870,tid-1:1048    tid-1:870,tid-1:1048
ADD COMMENT
0
Entering edit mode
ADD REPLY
3
Entering edit mode
11.9 years ago
KCC ★ 4.1k

I have been thinking along Keeping Tags That Are Mapped To The Same Place By Two Aligners, about how results from different aligners would compare. I think you can use a python program to do what you want.

Look at How Do I Do The Intersection Of Two Sam Files by Damian Kao, from one of my earlier questions.

You can modify it to just count the number of values in both files instead of printing them out. In addition, now that I know more python, I would change the dictionary to a set. Although, I don't think it will make it much faster, I think that it might reduce the space used.

I haven't tested this modification, but it should work:

import sys
from sets import Set

FileA = open(sys.argv[1],'r')
FileB = open(sys.argv[2],'r')

A = open("A.txt",'w')
B = open("B.txt",'w')
AB = open("Both.txt",'w')

reads = set([])
for name in FileA:
    reads.add(name) #store IDs in first file

for name in FileB:
    if name in reads:
        AB.write(name) #write IDs in intersection
    else:
        B.write(name) #write IDs only in second file

    reads.remove(name) #drop names that are in second file

for name in reads:
    A.write(name) #write IDs only in first file

If you saved this in a file called intersection.py, you can type "python intersection file1 file2" where "file1" and "file2" are you two lists of IDs.

The whole thing is O(N) time and O(N) space.

ADD COMMENT
0
Entering edit mode

reads.remove(name) should be withing if loop (see my answer). Plus, you can load set using: reads = set( name for name in open(sys.argv[1],'r') ) This should be slightly faster (and more pythonic;) )

ADD REPLY
0
Entering edit mode

Python sets have operators allowing to do this directly; considering you made set A from A.txt and B from B.txt, AB = A & B, Aonly = A - B , Bonly = B - A. Don't forget to close your files to free the allocated memory (or use with which is a very convient way to circumvent the traditional open then try then close statements using only five letters.

ADD REPLY
2
Entering edit mode
11.9 years ago
Johan ▴ 890

Picards CompareSAMs might be able to do what you are looking for: http://picard.sourceforge.net/command-line-overview.shtml#CompareSAMs assuming that your aligned data is available in bam-format, and that you're only interested in how many, not exactly which reads differ between the alignments.

ADD COMMENT
2
Entering edit mode
11.9 years ago

I find that this blog about performing set operations in the Unix shell to be a great reference: http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/

It's worth testing different approaches for memory usage and speed. Also if you use a recent version of GNU sort you can specify the amount of memory the shirt uses with "-S 50G" to use 50GB ram. This might stop sort from using intermediate files during the sort.

Cheers, Nathan

ADD COMMENT
0
Entering edit mode
11.9 years ago
Leszek 4.2k

I have updated George's code slightly:

#!/usr/bin/env python
"""Compares two files/list of elements
USAGE: python compare_sets.py file1 file2
"""
import sys

A = open("A.txt",'w')
B = open("B.txt",'w')
AB = open("Both.txt",'w')

#get names from first file into set
reads = set( name for name in open(sys.argv[1],'r') )

#parse second file
for name in open(sys.argv[2],'r'):
    if name in reads:
        AB.write(name)     #write IDs in intersection
        reads.remove(name) #drop names that are in second file
    else:
        B.write(name)      #write IDs only in second file

for name in reads:
    A.write(name)          #write IDs only in first file
ADD COMMENT
0
Entering edit mode
11.9 years ago

Just a comment: If your mapping tool calls multiple mapping loci for one read, I would highly suggest to unique the read IDs before doing any comparison. Otherwise, you might run into the problem of having more mapped reads than input reads.

ADD COMMENT

Login before adding your answer.

Traffic: 1538 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6