Hi,
I am trying to understand what throws a barcode collision. When a collision is detected, bcl2fastq throws the following error -
Barcodes with too few mismatches are ambiguous ( less than 2 times the number of mismatches plus 1)
Could someone clarify how the number of mismatches are summed, specifically is the sum taken from the total from i5 & i7 combined, the i5 & i7 separately, or just the i7? Below I looked a sample sheet with i7 (index) & i5 (index2) indices and want to say that the sum is taken from the i7 only because that is where the sum of mismatches seems to fit the error bcl2fastq throws.
Here are the relevant rows from the samplesheet -
index,index2,
CTTCCTTC,GAAGGAAG
CGTCTTCA,TGAAGACG
CTTCCTTC,GAAGGAAG
CGAACAAC,GTTGTTCG
GATCAGAT,AGATCTCG
TAGCTTAT,AGATCTCG
And here is what I got for the following --barcode-mismatches
arguments -
--barcode-mismatches 1
: No Barcode Collision
GATCAGAT+AGATCTCG
&TAGCTTAT+AGATCTCG
didn't throw a collision even though the i5's are the same
--barcode-mismatches 2
: Barcode Collision
std::exception::what: Barcode collision for barcodes: CTTCCTTC+GAAGGAAG, CGTCTTCA+TGAAGACG
- The i7's,
CTTCCTTC
&CGTCTTCA
, have 4 mismatches.
--barcode-mismatches 3
: Barcode Collision
std::exception::what: Barcode collision for barcodes: CTTCCTTC+GAAGGAAG, CGAACAAC+GTTGTTCG
- The i7's,
CTTCCTTC
&CGAACAAC
, have 5 mismatches.
Please let me know if I can clarify something. I don't quite have a grasp on this so I'd be happy to provide more info to get some help.
Oof I guess my third example isn't practical. But, thank you! Especially if the max allowed is 2, then I can see in the case w/
--barcode-mismatches 2
, it would be poor planning to have two indices that combined are only different by 4 mismatches.EDIT
After reviewing more runs, I think the rule is this - mismatches in the i5 are only considered if the i7 is ambiguous, however, each is considered separately. Take the example below, which didn't throw a barcode error. There are two samples in the same lane with i7 indices (index),
GCGGTATT
&CCGGAATT
, that only have two mismatches. This seems to satisfy the rule for the error -NUM MISMATCHES < (ALLOWED_BARCODE_MISMATCH * 2) + 1
. However, an error wasn't thrown. I believe this is because the i5 indices,GGTAACAA
andACCGAATG
, have 7 mismatches, which is well above the threshold. This is different from the cases in the original posting where the collision errors were thrown when the i5 index was also ambiguous.Does this make sense?
SampleSheet
Command
Result
No collision error
There's no error here because there is no possible sequence which is one mistake away from two different barcode sequences. But there are sequences which are two mistakes away from two different barcodes.