We used bcl2fastq (v2.20.0.422) to convert pair-end RNAseq image file to fastq file. When preparing the sequencer, we set both read length 51 and dual index length 8 (see part of RunInfo.xml below)
<Reads>
<Read Number="1" NumCycles="51" IsIndexedRead="N"/>
<Read Number="2" NumCycles="8" IsIndexedRead="Y"/>
<Read Number="3" NumCycles="8" IsIndexedRead="Y"/>
<Read Number="4" NumCycles="51" IsIndexedRead="N"/>
</Reads>
However, the reads were not distributed to the right library as expected in SampleSheet. 95% of reads were assigned to undetermined. We noticed that majority of 'undetermined reads' show the same sequence for i7 index position (but the sequence is not real a i7 sequence). We suspect that those undetermined 'i7 sequence' may be a part of adaptor sequence, and mistakenly treated as index sequence.
Here is my question: To find the adaptor sequence and index for each read, we want to let bcl2fastq to report the whole sequence without 'trimming' or 'masking' any index or adaptor. How can we change the bcl2fastq options to make it happen?
We tried to run the bcl2fastq without giving SampleSheet. It assigned generates 51bp read to undetermined without index squence on name. We were expecting generating ~100bp read containing index and adaptor sequence in fastq.
Thank you.
Use the code in my answer here to figure out which index sequences are actually present in your data with
undetermined
reads file where most reads ended up in those files initially with samplesheet (Demultiplexing reads with index present in the labels ). Sometimes you may have reverse complemented the sequences in samplesheet which will cause the reads to be placed in undetermined pool.bcl2fastq
does not trim the data by default (as long as you don't include the adapter sequences in samplesheet). It only relocates the index sequences to fastq headers when it demultiplexes data.Yeah, after checking the illumina sequence steps, I started to get that bcl2fastq does not trim data. Each cycle corresponds to each position in either read or index. In each cycle, read 1 was sequenced first (with 51 cycles in our case), then index i7 (with 8 cycles), then index i5 (8 cycles) and finally read 2 (51 cycles). So if priming works perfectly, no adaptor sequences will be sequenced, in other words, no whole sequences will be available.
I run the bcl2fastq without giving SampleSheet and allow the reads to go to undertermined. After parsing the index sequences for each read in undetermined files, I found 29% of reads can be assigned to at least one library (allowing 1 mismatch). We still do not know what the rest 70% reads. I personally feel it is the priming issue cause the problem. If the adaptor primers does not prime at right spot and the sequenced region are not real index region, the whole system will be mess-up.
Have you ever heard of 'Switching Mechanism At 5' end of RNA Template''? It is a Takara SMARTer Stranded Total RNA-Seq Kit. It added three G at 5' of transcript and simplify ligation and RT to one step. I think this three G causes the trouble when doing the sequencing.
Ok. Sounds like your issue is on experimental side rather than the demultiplexing. Check to see if the Takara kit instructions recommend specific steps (for sequencing and for data handling).