Hi, everybody. I have FASTQ headers of the form
@FCC3KD2ACXX:6:1101:1545:2184#ATCACGATC/1
The "ATCACGATC" portion of these older-style headers is supposed to be the "index sequence", or the molecular barcode of a multiplexed sample, according to the Wikipedia article. But I know what the barcodes are, and that isn't one of them. All the barcodes for this project are 6-8bp, not 9, and there are "only" about 300 legitimate barcodes in the lane.
Overall, there are over 255,000 different individual ones of these index sequences on various different reads, out of about 178M reads in the file. Some of them even contain Ns, but they're all exactly 9bp long. And this particular sequence (ATCACGATC) is by far the most prevalent -- it's on 90% of the reads, so it can't be succeeding at separating anything out very specifically.
I'm coming in late to a project, and all I have to go on is the FASTQ files, the list of barcodes, and some wet-lab protocol docs that I'm not particularly qualified to interpret. Any idea what this odd extraneous-looking sequence is? If so, thanks in advance!
Therefore it's not barcode metadata, it's something that was sequenced. I'd imagine it's the first 9 bases of the sequenced DNA, that has been cut out of the SEQ and pasted into the QNAME, probably as a ghetto method for parsing 9bp barcodes in the past.
EDIT: I wonder if the machine has given different barcodes for pairs of the same read...?
@John: With an over-clustered flowcell it is possible to get N's in tag reads (one or more and in particularly bad cases all can be N's).
A good sequencing facility will not release this kind of data (unless for investigative purposes). Here OP has been left holding the bag with a mishmash of info and has to make sense of it all.
Right, but what i'm saying is that these 9bp come from the read, not from some metadata added programatically. For example, I assume the machines usually read the first few bases, try to align the barcodes it knows about to 'call' a reads barcode, then add the barcode it thinks it matches to the QNAME. This way even if a base has an N in the sequenced barcode, the barcode in the QNAME will be correct. Or if the first base was omitted from sequencing somehow, it's still matched as being barcode X. But this machine, since it's putting Ns in the QNAME, is probably just cutting the first 9bp out of the read, and sticking it in the header - losing the QUAL score in the process for all 9bp. Real lazy.
Sure. That 9 bp came from a real read. Even though OP has 6-8 barcodes someone chose to run this as a 9 cycle (or 10, one base at end is removed by default) run.
In case of Illumina technology both tag reads are read completely independently of the main R1/R2 reads (sequence goes
R1 --> Index 1 --> Index 2 --> R2
). During the run the machine has no way of matching barcodes to anything. De-multiplexing is handled by the CASAVA/bcl2fastq software during post-processing of the run.As an aside: There is an easy way to get the index reads in fastq (with Q-scores) format as separate files (then one ends up with R1, R2, R3, R4 files, corresponding to the order of reads above) with bcl2fastq2 v.2.x with a command line option. With older versions of CASAVA there was some editing of run files required to do that.
Reason one gets N's in the tag reads is the same as the main reads. If the clusters become too fat/are too close, then the basecalling software is no longer able to call a base and will use N. This happens when a flowcell is over-clustered (and there also tag reads are more vulnerable than the main reads). It is possible to get perfectly good sequence on main reads but lose the tag reads, rendering the run useless.
Ah, i didn't know it was a totally separate read. I assumed it was sequenced with everything else and then cut out programatically afterward. A totally separate sequencing step eh? Gosh.
So now it makes sense to me - thank you both genomax2 and igor :)
The barcode read is a separate read. If you really want, you can get those quality scores.
Thanks for the excellent discussion, which I'm just now getting around to reading in detail, but continues to be relevant since I'm dealing with several different flavors of unpleasantly non-"standard" data. Every bit of crowdsourced clue I can get helps! This particular dataset did turn out to respond well to standard demultiplexing, but not all of them do. I had the pleasure of learning informatics in a top-of-the-line organization with excellent standards, where I had access to everyone from the lab techs to the people running the mapping pipeline in case questions arose. In my new situation...not so much. :)