Illumina reads typically have short barcodes of around 8bp. This is fine when you are sequencing a couple of people with unamplified WGS on a full flowcell. However, Illumina platforms have a non-insignificant rate of misassigned barcodes. The reason for this is still not clear; I suspect that some of it is sequencing error, some is impure reagents, and some is adapters breaking off, floating around, and ligating to reads from the wrong library. Regardless, there are different rates of crosstalk on different platforms. HiSeq 2500 seems to have much higher crosstalk than NextSeq, but it's difficult to validate because different runs give different results. But currently, JGI is operating under the assumption that NextSeq gives the lowest crosstalk of all Illumina platforms, and JGI sequences crosstalk-sensitive things on NextSeq despite its much lower quality compared to HiSeq 2500.
JGI does a lot of single-cell sequencing. These cells are lysed and MDA-amplified prior to sequencing; the result is an exponential range of coverage, which is very spiky. If you are just sequencing a single organism in a run, it doesn't matter. But, JGI sequences 92 individual single cells on a 96-well plate, all multiplexed together. If there is no crosstalk, that's fine; you get 92 kind of bad assemblies (hopefully 60% genome recovery for each well). But, there is a significant amount of crosstalk. This causes huge problems with assembly - even a 0.01% rate of crosstalk can result in 50% or more of non-target genome in your assembly, due to MDA's spikiness.
0.01% crosstalk is not important when you multiplex 10 humans, and only care about heterozygous or homozygous calls (though of course it is still crucial when looking for low-allele-fraction variants). But for single-cell sequencing, it is deal-breaking. The current best single-cell assembler (for Illumina reads) is Spades. It can handle MDA bias, which will yield 1x coverage in some places, and 100,000x coverage in other places. That means that 0.01% crosstalk will give 10x coverage from a different, multiplexed sample, to all other samples. Meaning, they will all assemble the same contig, which was derived from some other organism. So, you get false results.
This is a fundamental limitation of current technology. Reagents are impure (meaning, your adapters do not have 100% the barcodes you expect), sequencing platforms are inaccurate (Illumina base-calling is very sensitive to leading and trailing bases; with an 8-bp barcode, you basically get 6 "decent" bases) and, as far as I can tell, adapters do in fact break off and ligate to something else.
There is no overall solution to this. However! If you are doing multiplexed single-cell sequencing on Illumina platforms, I can recommend this:
1) Allow zero barcode mismatches when demultiplexing. This is absolutely crucial. You will, of course, and up with far more unbinned reads, but that's just the price of correctness.
2) Use NextSeq. In our tests, it has yielded the lowest crosstalk rate of NextSeq/HiSeq2500/Miseq. The error rate is vastly higher than HiSeq2500, of course, but in this situation crosstalk is more important.
3) Run CrossBlock. In synthetic tests, it eliminates 100% of contaminant contigs, with a false-positive removal rate of 0.03% (ignoring contigs under 500bp). This assumes that you multiplexed different organisms; with identical organisms, the false-positive rate will increase. Still, it can usually deal with 2-3 copies of an organism with no false positive removals. More than that is dicey. It will remove some contigs, but they will still be present somewhere. In practice, I have found that CrossBlock retains contigs somewhere (meaning, at least one copy of a sequence exists) even when there are 20 copies of the same organism.
What does CrossBlock do?
It compares coverage of contigs from the library that generated them, to coverage from all other libraries. If the coverage from other libraries is dramatically higher, a contig is considered a contaminant. It's quite simple.
When should you use CrossBlock?
You should always use CrossBlock when dealing with different organisms, multiplexed together, where there is spiky coverage (single-cell, but possibly other situations).
When should you not use CrossBlock?
Most of the time. CrossBlock is only relevant to assembling novel genomes. If you are not doing assembly, don't use it. If you are not multiplexing different organisms, don't use it. Particularly, if you are multiplexing lots of things that might be the same organism... Don't use it; it can yield a lot of false-positive removals in that case. It's actually pretty good when you have 2-5 members of the same species on a plate. But if you already know you have a plate of 96 cells that are all different strains of the same species, don't use CrossBlock.
Hi Andrew,
Yes, a log file would be essential here so I can see what phase it's stopping at. Additionally, the entire command line, and contents of readnamefile and refnamefile (if you used those), would be helpful, as would the BBMap version.
To clarify, are these transcriptomes of three different organisms that were multiplexed together? And if so, how different are they, and what clades?
-Brian