I recently received some data from an Illumina HiSeq lane, in which the reads are all supposed to start with one of ten different 6bp barcodes. But when I try to split up the data by barcode, more than half of the reads don't actually match any of the barcodes. I imagine that this is due to the relatively high error rate of HiSeq during the first few cycles.
How do you usually deal with this? I assume one option would be some sort of fuzzy matching from real sequence to barcode (e.g. allow for one different nt), although I'm afraid this might introduce a whole new of bias. Is there any software for this out there?
None of the headers appear to contain a barcode, so I think this isn't the problem. I didn't try to allow for mismatches yet - this is what I meant by fuzzy matching above.
Which version of CASAVA was used for your FASTQ? Rather, what is your sequencer? Even better, could you paste 1 whole read including header, sequence and quality here?
You don't need to do fuzzy matching as the maximum number of mismatches with which you could safely identify the barcode is just 1. For example in perl, you could just write a subroutine:
and then call
hd(str1, str2)
. The value it returns tells you the number of mismatches, and you can safely consider up to 1 mismatch, I believe. If not, I guess some one would correct me.Thank you for the suggestions, I actually just ended up using the fastx barcode splitter with the --mismatch flag.