Comparing 2 Large Lists (Millions Of Rows) To Identify Shared And Exclusive Elements

Entering edit mode

12.2 years ago

Bioinfosm ▴ 620

Hi,

Am looking for a fast way to parse through 2 large lists with millions of elements and identify the ones shared by both the lists, exclusive to list1, exclusive to list2. For smaller lists, perl hashes or the venny tool are useful, but I needed to do the same with these huge lists.

FWIW, these lists are actually read-ID from NGS data. The reads were aligned using 2 different approaches, and I wish to investigate of the ~ 30 million reads mapped by both approaches, how many are common, how many and which ones are aligned using only one approach and not the other. I mention this in case there is something in bedtools, bamtools or the like that might be relevant here.

thanks!

EDIT: Thanks for all the responses. I got bogged by other things and never got the chance to check on these. Will work my way through and add further comments/notes..

unix list intersect • 44k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 12.2 years ago by Bioinfosm ▴ 620

Entering edit mode

If the aligners in use keep all the reads in the input order (bwa/bowtie/soap2 among many others do that by default), you can simply use paste or reading two files line by line.

ADD REPLY • link 12.2 years ago by lh3 33k

Entering edit mode

How much memory do you have? If they are just read ids, 30 million (assuming 30 characters each) will probably take around a gig of memory. It's pretty reasonable.

ADD REPLY • link 12.2 years ago by Damian Kao 16k

Entering edit mode

I've seen python dictionaries go above 10 gigabytes. So, depending on the size and the memory available, I think python could work here.

ADD REPLY • link 12.2 years ago by KCC ★ 4.1k

Entering edit mode

you don't need dictionary, use set() in python

ADD REPLY • link 12.2 years ago by Leszek 4.2k

Entering edit mode

12.2 years ago

Aaronquinlan 12k

Assuming you are referring to IDs and you have lists that are structured something like the following:

$ cat list-1.txt
foo
bar
baz
zab
oof

$ cat list-2.txt
zab
baz
zit

You can first sort the two files:

$ sort list-1.txt > list-1.sorted.txt
$ sort list-2.txt > list-2.sorted.txt

$ cat list-1.sorted.txt
bar
baz
foo
oof
zab

$ cat list-2.sorted.txt
baz
zab
zit

Now, you can use join to find the elements that are common to both sets:

$ join -1 1 -2 1 list-1.sorted.txt list-2.sorted.txt
baz
zab

Use the -v option to find those elements that are exclusive to set 1:

join -v 1 -1 1 -2 1 list-1.sorted.txt list-2.sorted.txt
bar
foo
oof

Exclusive to set 2:

join -v 2 -1 1 -2 1 list-1.sorted.txt list-2.sorted.txt
zit

Or, you can use the comm command to do it all in one step. The first column are elements exclusive to set 1, the second column is elements that are exclusive to set 2, and the third column is elements that are common to the two sets. join is more flexible, but in simple cases like this, comm can be quite useful.

$ comm list-1.sorted.txt list-2.sorted.txt 
bar
        baz
foo
oof
        zab
    zit

ADD COMMENT • link updated 12.2 years ago by Obi Griffith 20k • written 12.2 years ago by Aaronquinlan 12k

Entering edit mode

I hope I don't sound like I'm whining. I don't mean to be. I'm a little surprised that the bash solution is so strongly preferred given that it duplicates the original files in the sorting step. Also, sorting is O(NlogN) and (for me at least) rather painful for millions of reads. I used to use bash for tasks like this, but moved to python for the above reasons.

ADD REPLY • link 12.2 years ago by KCC ★ 4.1k

Entering edit mode

30 million strings is pushing the limit of memory. Even if you do it right in C (EDIT: with a hash table; there are much more memory-efficient solutions, e.g. using FM-index, but that is overkilling), you need around 1GB to hold them in RAM, depending on the string lengths, and 2-3GB in perl/python due to their overhead. If the strings are longer or OP has more alignments to process in future, loading keys into RAM will be more problematic. fgrep is much worse here. Aaron's solution works with billions of alignments under trivial memory. BTW, for simple joining unsorted list, awk also works: awk 'BEGIN{while((getline<"A.txt")>0)l[$1]=1}l[$1]' B.txt.

ADD REPLY • link 12.2 years ago by lh3 33k

Entering edit mode

Fair enough. My main platform has 8 gigs. RAM is not usually one of my issues. Thanks for the alternative perspective.

ADD REPLY • link 12.2 years ago by KCC ★ 4.1k

Entering edit mode

When other columns in the files can identify the file number (e.g. 1 file has PG:bowtie and the other has PG:bowtie2), I usually use join -a1 -a2 list-1.sorted.txt list-2.sorted.txt. This puts everything in one data stream.

ADD REPLY • link 12.2 years ago by lh3 33k

Entering edit mode

Thanks Heng, that is indeed a better option.

ADD REPLY • link 12.2 years ago by Aaronquinlan 12k

Entering edit mode

I like the comm solution.

ADD REPLY • link 12.1 years ago by Manu Prestat 4.1k

Entering edit mode

12.2 years ago

sjneph ▴ 690

Stick to standard unix utilities unless something more fancy is actually needed. This method will be fast and furious.


  grep -F -x -f file1 file2 > common
  grep -v -F -x -f common file1 > file1.only
  grep -v -F -x -f common file2 > file2.only

ADD COMMENT • link 12.2 years ago by sjneph ▴ 690

Entering edit mode

12.2 years ago

Pierre Lindenbaum 166k

Here is a java program comparing two BAMs.

I've tested it using the picard library 1.62 and berkeleyDB java edition 4.1 , it should work for some more recent versions of the libraries.

	/**
	* Author: Pierre Lindenbaum PhD. @yokofakun
	* http://www.biostars.org/p/63016/
	* Compare two BAM files
	* tested with picard.1-62 and berkeleydb java edition 4.1.10
	*
	* compilation & exec:
	* javac -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar Biostar63016.java
	* mkdir TMP
	* java -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar:. Biostar63016 -d TMP file1.bam file2.bam
	*/

	import java.io.File;
	import java.io.IOException;
	import java.util.ArrayList;
	import java.util.HashMap;
	import java.util.HashSet;
	import java.util.Iterator;
	import java.util.List;
	import java.util.Map;
	import java.util.Set;
	import java.util.logging.Logger;

	import net.sf.samtools.SAMFileReader;
	import net.sf.samtools.SAMFileReader.ValidationStringency;
	import net.sf.samtools.SAMRecord;
	import net.sf.samtools.SAMSequenceRecord;

	import com.sleepycat.bind.tuple.StringBinding;
	import com.sleepycat.bind.tuple.TupleInput;
	import com.sleepycat.bind.tuple.TupleOutput;
	import com.sleepycat.je.Cursor;
	import com.sleepycat.je.Database;
	import com.sleepycat.je.DatabaseConfig;
	import com.sleepycat.je.DatabaseEntry;
	import com.sleepycat.je.Environment;
	import com.sleepycat.je.EnvironmentConfig;
	import com.sleepycat.je.LockMode;
	import com.sleepycat.je.OperationStatus;
	import com.sleepycat.je.Transaction;


	public class Biostar63016
	{
	private static final Logger LOG=Logger.getLogger(Biostar63016.class.getName());
	private static final String DATABASENAME="read2pos";
	private File dbHome;
	private Environment environment=null;
	private Database database=null;
	private Transaction txn;
	private int currentSamFileIndex=0;

	private static class Match
	{
	byte tid;
	int pos;

	@Override
	public int hashCode()
	{
	int result = 1;
	result = 31 * result + pos;
	result = 31 * result + tid;
	return result;
	}
	@Override
	public boolean equals(Object obj)
	{
	if (this == obj) { return true; }
	if (obj == null) { return false; }
	Match other = (Match) obj;
	if (tid != other.tid) { return false; }
	if(tid==-1) return true;
	if (pos != other.pos) { return false; }
	return true;
	}
	@Override
	public String toString()
	{
	if(tid<0) return "unmapped";
	return String.valueOf(tid)+":"+pos;
	}
	}

	/** fileid to matches */
	private List<Set<Match>> decode(final DatabaseEntry data)
	throws IOException
	{
	TupleInput in=new TupleInput(data.getData());
	final byte nfiles=2;
	List<Set<Match>> L=new ArrayList<Set<Match>>(nfiles);
	for(int i=0;i< nfiles;++i)
	{
	byte nmatches=in.readByte();
	Set<Match> set=new HashSet<Match>(nmatches);
	L.add(set);
	for(int j=0;j< nmatches;++j)
	{
	Match m=new Match();
	m.tid=in.readByte();
	m.pos=in.readInt();
	set.add(m);
	}
	}

	return L;
	}

	private void print(final Set<Match> set,Map<Integer,String> tid2name)
	{
	boolean first=true;
	for(Match m:set)
	{
	if(!first)System.out.print(',');
	first=false;
	if(m.tid<0){ System.out.print("unmapped"); continue;}
	String seqName=tid2name.get(m.tid);
	if(seqName==null) seqName="tid-"+m.tid;
	System.out.print(String.valueOf(seqName+":"+(m.pos)));
	}
	if(first) System.out.print("(empty)");
	}

	private void encode(DatabaseEntry data,final List<Set<Match>> L)
	{
	TupleOutput out=new TupleOutput();
	for(Set<Match> set:L)
	{
	out.writeByte((byte)Math.min(Byte.MAX_VALUE, set.size()));
	int count=0;
	for(Match m:set)
	{
	if(++count>=Byte.MAX_VALUE) break;
	out.writeByte(m.tid);
	out.writeInt(m.pos);
	}
	}
	data.setData(out.getBufferBytes(),out.getBufferOffset(),out.getBufferLength());
	}

	private void run(String[] args)
	throws Exception
	{
	int optind=0;
	while(optind< args.length)
	{
	if(args[optind].equals("-h") \|\|
	args[optind].equals("-help") \|\|
	args[optind].equals("--help"))
	{
	System.err.println("Pierre Lindenbaum PhD. 2013");
	System.err.println("Options:");
	System.err.println(" -h help; This screen.");
	return;
	}
	else if(args[optind].equals("-d") && optind+1< args.length)
	{
	this.dbHome=new File(args[++optind]);
	}
	else if(args[optind].equals("--"))
	{
	optind++;
	break;
	}
	else if(args[optind].startsWith("-"))
	{
	System.err.println("Unknown option "+args[optind]);
	return;
	}
	else
	{
	break;
	}
	++optind;
	}
	if(this.dbHome==null)
	{
	System.err.println("db-home undefined");
	return;
	}
	if(args.length-optind!=2)
	{
	System.err.println("Expected 2 bams");
	return;
	}
	Map<Integer,String> tid2name=null;
	DatabaseEntry key=new DatabaseEntry();
	DatabaseEntry data=new DatabaseEntry();

	EnvironmentConfig envConfig= new EnvironmentConfig();
	envConfig.setAllowCreate(true);
	envConfig.setReadOnly(false);
	envConfig.setConfigParam(EnvironmentConfig.LOG_FILE_MAX,"250000000");
	envConfig.setTransactional(true);
	this.environment= new Environment(this.dbHome, envConfig);

	this.txn=this.environment.beginTransaction(null, null);
	DatabaseConfig cfg= new DatabaseConfig();
	cfg.setAllowCreate(true);
	cfg.setReadOnly(false);
	cfg.setTransactional(true);
	this.database= this.environment.openDatabase(txn,DATABASENAME,cfg);

	currentSamFileIndex=0;
	while(optind<args.length)
	{
	long nReads=0L;
	File samFile=new File(args[optind++]);
	SAMFileReader samFileReader=new SAMFileReader(samFile);
	samFileReader.setValidationStringency(ValidationStringency.SILENT);
	if(samFileReader.getFileHeader().getSequenceDictionary().getSequences().size()>=Byte.MAX_VALUE)
	{
	System.err.println("Too many Ref Sequences . Limited to "+Byte.MAX_VALUE);
	return;
	}
	if(tid2name==null)
	{
	tid2name=new HashMap<Integer, String>();
	for(SAMSequenceRecord ssr:samFileReader.getFileHeader().getSequenceDictionary().getSequences())
	{
	tid2name.put(ssr.getSequenceIndex(), ssr.getSequenceName());
	}
	}

	for(Iterator<SAMRecord> iter=samFileReader.iterator();
	iter.hasNext(); )
	{
	if(nReads++%10000000==0) LOG.info("in "+samFile+" "+nReads);
	SAMRecord rec=iter.next();
	StringBinding.stringToEntry(rec.getReadName(), key);

	List<Set<Match>> matches=null;
	if(this.database.get(this.txn, key, data, LockMode.DEFAULT)==OperationStatus.SUCCESS)
	{
	matches=decode(data);
	}
	else
	{
	matches=new ArrayList<Set<Match>>();
	for(int i=0;i< 2;++i) matches.add(new HashSet<Match>());
	}

	Match match=new Match();
	match.tid=(byte)rec.getReferenceIndex().intValue();
	match.pos=rec.getAlignmentStart();
	matches.get(this.currentSamFileIndex).add(match);
	encode(data,matches);
	if(this.database.put(this.txn, key, data)!=OperationStatus.SUCCESS)
	{
	System.err.println("BDB error.");
	System.exit(-1);
	}
	}
	samFileReader.close();
	this.currentSamFileIndex++;
	}
	//compute the differences for each read
	key=new DatabaseEntry();
	Cursor c=this.database.openCursor(txn,null);;
	while(c.getNext(key, data,LockMode.DEFAULT)==OperationStatus.SUCCESS)
	{
	System.out.print(StringBinding.entryToString(key));
	List<Set<Match>> matches=decode(data);
	final Set<Match> first=matches.get(0);
	final Set<Match> second=matches.get(1);
	if(first.equals(second))
	{
	System.out.print("\tEQ\t");
	}
	else
	{
	System.out.print("\tNE\t");
	}
	print(first,tid2name);
	System.out.print("\t");
	print(second,tid2name);
	System.out.println();
	}
	c.close();
	this.database.close();
	this.environment.removeDatabase(txn, DATABASENAME);
	this.txn.commit();
	this.environment.close();
	}

	public static void main(String[] args) throws Exception
	{
	new Biostar63016().run(args);
	}

	}

view raw Biostar63016.java hosted with ❤ by GitHub

Compilation & execute:

javac -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar Biostar63016.java
mkdir TMP
java -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar:. Biostar63016 -d TMP file1.bam file2.bam

here is a sample of the output (read-name / flag / positions-file1 / positions-file2 ).

$ java -cp ~/package/picard-tools-1.62/picard-1.62.jar:/home/lindenb/package/picard-tools-1.62/sam-1.62.jar:/home/lindenb/package/je-4.1.10/lib/je-4.1.10.jar:. Biostar63016 -d TMP ~/samtools-0.1.18/examples/ex1b.bam ~/samtools-0.1.18/examples/ex1f.bam   |  head -n 50

Feb 06, 2013 10:02:45 PM Biostar63016 run
INFO: in /home/lindenb/samtools-0.1.18/examples/ex1b.bam 1
Feb 06, 2013 10:02:47 PM Biostar63016 run
INFO: in /home/lindenb/samtools-0.1.18/examples/ex1f.bam 1
B7_589:1:101:825:28    NE    (empty)    tid-0:1079,tid-0:879
B7_589:1:101:825:28a    NE    (empty)    tid-0:1079,tid-0:879
B7_589:1:101:825:28b    EQ    tid-0:1079,tid-0:879    tid-0:1079,tid-0:879
B7_589:1:110:543:934    NE    (empty)    tid-1:514,tid-1:700
B7_589:1:110:543:934a    NE    (empty)    tid-1:514,tid-1:700
B7_589:1:110:543:934b    EQ    tid-1:514,tid-1:700    tid-1:514,tid-1:700
B7_589:1:122:337:968    NE    (empty)    tid-0:823,tid-0:981
B7_589:1:122:337:968a    NE    (empty)    tid-0:823,tid-0:981
B7_589:1:122:337:968b    EQ    tid-0:823,tid-0:981    tid-0:823,tid-0:981
B7_589:1:122:77:789    NE    (empty)    tid-0:223,tid-0:396
B7_589:1:122:77:789a    NE    (empty)    tid-0:223,tid-0:396
B7_589:1:122:77:789b    EQ    tid-0:223,tid-0:396    tid-0:223,tid-0:396
B7_589:1:168:69:249    NE    (empty)    tid-1:936,tid-1:1125
B7_589:1:168:69:249a    NE    (empty)    tid-1:936,tid-1:1125
B7_589:1:168:69:249b    EQ    tid-1:936,tid-1:1125    tid-1:936,tid-1:1125
B7_589:1:29:529:379    NE    (empty)    tid-0:1117,tid-0:926
B7_589:1:29:529:379a    NE    (empty)    tid-0:1117,tid-0:926
B7_589:1:29:529:379b    EQ    tid-0:1117,tid-0:926    tid-0:1117,tid-0:926
B7_589:2:30:644:942    NE    (empty)    tid-1:1229,tid-1:1045
B7_589:2:30:644:942a    NE    (empty)    tid-1:1229,tid-1:1045
B7_589:2:30:644:942b    EQ    tid-1:1229,tid-1:1045    tid-1:1229,tid-1:1045
B7_589:2:73:730:487    NE    (empty)    tid-0:604,tid-0:770
B7_589:2:73:730:487a    NE    (empty)    tid-0:604,tid-0:770
B7_589:2:73:730:487b    EQ    tid-0:604,tid-0:770    tid-0:604,tid-0:770
B7_589:2:9:49:661    NE    (empty)    tid-1:591,tid-1:747
B7_589:2:9:49:661a    NE    (empty)    tid-1:591,tid-1:747
B7_589:2:9:49:661b    EQ    tid-1:591,tid-1:747    tid-1:591,tid-1:747
B7_589:3:71:478:175    NE    (empty)    tid-1:171,tid-1:317
B7_589:3:71:478:175a    NE    (empty)    tid-1:171,tid-1:317
B7_589:3:71:478:175b    EQ    tid-1:171,tid-1:317    tid-1:171,tid-1:317
B7_589:3:82:13:897    NE    (empty)    tid-1:606,tid-1:453
B7_589:3:82:13:897a    NE    (empty)    tid-1:606,tid-1:453
B7_589:3:82:13:897b    EQ    tid-1:606,tid-1:453    tid-1:606,tid-1:453
B7_589:4:54:989:654    NE    (empty)    tid-0:1108,tid-0:1296
B7_589:4:54:989:654a    NE    (empty)    tid-0:1108,tid-0:1296
B7_589:4:54:989:654b    EQ    tid-0:1108,tid-0:1296    tid-0:1108,tid-0:1296
B7_589:5:147:405:738    NE    (empty)    tid-1:870,tid-1:1048
B7_589:5:147:405:738a    NE    (empty)    tid-1:870,tid-1:1048
B7_589:5:147:405:738b    EQ    tid-1:870,tid-1:1048    tid-1:870,tid-1:1048

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 12.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

blogged about it here: http://plindenbaum.blogspot.fr/2013/02/a-tool-to-compare-bams.html

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

12.2 years ago

KCC ★ 4.1k

I have been thinking along Keeping Tags That Are Mapped To The Same Place By Two Aligners, about how results from different aligners would compare. I think you can use a python program to do what you want.

Look at How Do I Do The Intersection Of Two Sam Files by Damian Kao, from one of my earlier questions.

You can modify it to just count the number of values in both files instead of printing them out. In addition, now that I know more python, I would change the dictionary to a set. Although, I don't think it will make it much faster, I think that it might reduce the space used.

I haven't tested this modification, but it should work:

import sys
from sets import Set

FileA = open(sys.argv[1],'r')
FileB = open(sys.argv[2],'r')

A = open("A.txt",'w')
B = open("B.txt",'w')
AB = open("Both.txt",'w')

reads = set([])
for name in FileA:
    reads.add(name) #store IDs in first file

for name in FileB:
    if name in reads:
        AB.write(name) #write IDs in intersection
    else:
        B.write(name) #write IDs only in second file

    reads.remove(name) #drop names that are in second file

for name in reads:
    A.write(name) #write IDs only in first file

If you saved this in a file called intersection.py, you can type "python intersection file1 file2" where "file1" and "file2" are you two lists of IDs.

The whole thing is O(N) time and O(N) space.

ADD COMMENT • link 12.2 years ago by KCC ★ 4.1k

Entering edit mode

reads.remove(name) should be withing if loop (see my answer). Plus, you can load set using: reads = set( name for name in open(sys.argv[1],'r') ) This should be slightly faster (and more pythonic;) )

ADD REPLY • link 12.2 years ago by Leszek 4.2k

Entering edit mode

Python sets have operators allowing to do this directly; considering you made set A from A.txt and B from B.txt, AB = A & B, Aonly = A - B , Bonly = B - A. Don't forget to close your files to free the allocated memory (or use with which is a very convient way to circumvent the traditional open then try then close statements using only five letters.

ADD REPLY • link 12.1 years ago by Manu Prestat 4.1k

Entering edit mode

12.2 years ago

Johan ▴ 890

Picards CompareSAMs might be able to do what you are looking for: http://picard.sourceforge.net/command-line-overview.shtml#CompareSAMs assuming that your aligned data is available in bam-format, and that you're only interested in how many, not exactly which reads differ between the alignments.

ADD COMMENT • link 12.2 years ago by Johan ▴ 890

Entering edit mode

12.2 years ago

Nathan S. Watson-Haigh ▴ 200

I find that this blog about performing set operations in the Unix shell to be a great reference: http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/

It's worth testing different approaches for memory usage and speed. Also if you use a recent version of GNU sort you can specify the amount of memory the shirt uses with "-S 50G" to use 50GB ram. This might stop sort from using intermediate files during the sort.

Cheers, Nathan

ADD COMMENT • link 12.2 years ago by Nathan S. Watson-Haigh ▴ 200

Entering edit mode

12.2 years ago

Leszek 4.2k

I have updated George's code slightly:

#!/usr/bin/env python
"""Compares two files/list of elements
USAGE: python compare_sets.py file1 file2
"""
import sys

A = open("A.txt",'w')
B = open("B.txt",'w')
AB = open("Both.txt",'w')

#get names from first file into set
reads = set( name for name in open(sys.argv[1],'r') )

#parse second file
for name in open(sys.argv[2],'r'):
    if name in reads:
        AB.write(name)     #write IDs in intersection
        reads.remove(name) #drop names that are in second file
    else:
        B.write(name)      #write IDs only in second file

for name in reads:
    A.write(name)          #write IDs only in first file

ADD COMMENT • link 12.2 years ago by Leszek 4.2k

Entering edit mode

12.2 years ago

David Langenberger 11k

Just a comment: If your mapping tool calls multiple mapping loci for one read, I would highly suggest to unique the read IDs before doing any comparison. Otherwise, you might run into the problem of having more mapped reads than input reads.

ADD COMMENT • link 12.2 years ago by David Langenberger 11k