Here is a java program comparing two BAMs.
I've tested it using the picard library 1.62 and berkeleyDB java edition 4.1 , it should work for some more recent versions of the libraries.
Compilation & execute:
javac -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar Biostar63016.java
mkdir TMP
java -cp path/to/picard.jar:path/to/sam.jar:path/to/je-4.1.10.jar:. Biostar63016 -d TMP file1.bam file2.bam
here is a sample of the output (read-name / flag / positions-file1 / positions-file2 ).
$ java -cp ~/package/picard-tools-1.62/picard-1.62.jar:/home/lindenb/package/picard-tools-1.62/sam-1.62.jar:/home/lindenb/package/je-4.1.10/lib/je-4.1.10.jar:. Biostar63016 -d TMP ~/samtools-0.1.18/examples/ex1b.bam ~/samtools-0.1.18/examples/ex1f.bam | head -n 50
Feb 06, 2013 10:02:45 PM Biostar63016 run
INFO: in /home/lindenb/samtools-0.1.18/examples/ex1b.bam 1
Feb 06, 2013 10:02:47 PM Biostar63016 run
INFO: in /home/lindenb/samtools-0.1.18/examples/ex1f.bam 1
B7_589:1:101:825:28 NE (empty) tid-0:1079,tid-0:879
B7_589:1:101:825:28a NE (empty) tid-0:1079,tid-0:879
B7_589:1:101:825:28b EQ tid-0:1079,tid-0:879 tid-0:1079,tid-0:879
B7_589:1:110:543:934 NE (empty) tid-1:514,tid-1:700
B7_589:1:110:543:934a NE (empty) tid-1:514,tid-1:700
B7_589:1:110:543:934b EQ tid-1:514,tid-1:700 tid-1:514,tid-1:700
B7_589:1:122:337:968 NE (empty) tid-0:823,tid-0:981
B7_589:1:122:337:968a NE (empty) tid-0:823,tid-0:981
B7_589:1:122:337:968b EQ tid-0:823,tid-0:981 tid-0:823,tid-0:981
B7_589:1:122:77:789 NE (empty) tid-0:223,tid-0:396
B7_589:1:122:77:789a NE (empty) tid-0:223,tid-0:396
B7_589:1:122:77:789b EQ tid-0:223,tid-0:396 tid-0:223,tid-0:396
B7_589:1:168:69:249 NE (empty) tid-1:936,tid-1:1125
B7_589:1:168:69:249a NE (empty) tid-1:936,tid-1:1125
B7_589:1:168:69:249b EQ tid-1:936,tid-1:1125 tid-1:936,tid-1:1125
B7_589:1:29:529:379 NE (empty) tid-0:1117,tid-0:926
B7_589:1:29:529:379a NE (empty) tid-0:1117,tid-0:926
B7_589:1:29:529:379b EQ tid-0:1117,tid-0:926 tid-0:1117,tid-0:926
B7_589:2:30:644:942 NE (empty) tid-1:1229,tid-1:1045
B7_589:2:30:644:942a NE (empty) tid-1:1229,tid-1:1045
B7_589:2:30:644:942b EQ tid-1:1229,tid-1:1045 tid-1:1229,tid-1:1045
B7_589:2:73:730:487 NE (empty) tid-0:604,tid-0:770
B7_589:2:73:730:487a NE (empty) tid-0:604,tid-0:770
B7_589:2:73:730:487b EQ tid-0:604,tid-0:770 tid-0:604,tid-0:770
B7_589:2:9:49:661 NE (empty) tid-1:591,tid-1:747
B7_589:2:9:49:661a NE (empty) tid-1:591,tid-1:747
B7_589:2:9:49:661b EQ tid-1:591,tid-1:747 tid-1:591,tid-1:747
B7_589:3:71:478:175 NE (empty) tid-1:171,tid-1:317
B7_589:3:71:478:175a NE (empty) tid-1:171,tid-1:317
B7_589:3:71:478:175b EQ tid-1:171,tid-1:317 tid-1:171,tid-1:317
B7_589:3:82:13:897 NE (empty) tid-1:606,tid-1:453
B7_589:3:82:13:897a NE (empty) tid-1:606,tid-1:453
B7_589:3:82:13:897b EQ tid-1:606,tid-1:453 tid-1:606,tid-1:453
B7_589:4:54:989:654 NE (empty) tid-0:1108,tid-0:1296
B7_589:4:54:989:654a NE (empty) tid-0:1108,tid-0:1296
B7_589:4:54:989:654b EQ tid-0:1108,tid-0:1296 tid-0:1108,tid-0:1296
B7_589:5:147:405:738 NE (empty) tid-1:870,tid-1:1048
B7_589:5:147:405:738a NE (empty) tid-1:870,tid-1:1048
B7_589:5:147:405:738b EQ tid-1:870,tid-1:1048 tid-1:870,tid-1:1048
If the aligners in use keep all the reads in the input order (bwa/bowtie/soap2 among many others do that by default), you can simply use
paste
or reading two files line by line.How much memory do you have? If they are just read ids, 30 million (assuming 30 characters each) will probably take around a gig of memory. It's pretty reasonable.
I've seen python dictionaries go above 10 gigabytes. So, depending on the size and the memory available, I think python could work here.
you don't need dictionary, use set() in python