Question

bcl2fastq report the whole sequences without trimming index and adaptor

1

Entering edit mode

6.9 years ago

sckinta ▴ 730

We used bcl2fastq (v2.20.0.422) to convert pair-end RNAseq image file to fastq file. When preparing the sequencer, we set both read length 51 and dual index length 8 (see part of RunInfo.xml below)

<Reads>
                        <Read Number="1" NumCycles="51" IsIndexedRead="N"/>
                        <Read Number="2" NumCycles="8" IsIndexedRead="Y"/>
                        <Read Number="3" NumCycles="8" IsIndexedRead="Y"/>
                        <Read Number="4" NumCycles="51" IsIndexedRead="N"/>
                </Reads>

However, the reads were not distributed to the right library as expected in SampleSheet. 95% of reads were assigned to undetermined. We noticed that majority of 'undetermined reads' show the same sequence for i7 index position (but the sequence is not real a i7 sequence). We suspect that those undetermined 'i7 sequence' may be a part of adaptor sequence, and mistakenly treated as index sequence.

Here is my question: To find the adaptor sequence and index for each read, we want to let bcl2fastq to report the whole sequence without 'trimming' or 'masking' any index or adaptor. How can we change the bcl2fastq options to make it happen?

We tried to run the bcl2fastq without giving SampleSheet. It assigned generates 51bp read to undetermined without index squence on name. We were expecting generating ~100bp read containing index and adaptor sequence in fastq.

Thank you.

sequencing next-gen • 8.0k views

ADD COMMENT • link updated 6.9 years ago by drkennetz ▴ 560 • written 6.9 years ago by sckinta ▴ 730

1

Entering edit mode

Use the code in my answer here to figure out which index sequences are actually present in your data with undetermined reads file where most reads ended up in those files initially with samplesheet (Demultiplexing reads with index present in the labels ). Sometimes you may have reverse complemented the sequences in samplesheet which will cause the reads to be placed in undetermined pool.

bcl2fastq does not trim the data by default (as long as you don't include the adapter sequences in samplesheet). It only relocates the index sequences to fastq headers when it demultiplexes data.

ADD REPLY • link 6.9 years ago by GenoMax 151k

0

Entering edit mode

Yeah, after checking the illumina sequence steps, I started to get that bcl2fastq does not trim data. Each cycle corresponds to each position in either read or index. In each cycle, read 1 was sequenced first (with 51 cycles in our case), then index i7 (with 8 cycles), then index i5 (8 cycles) and finally read 2 (51 cycles). So if priming works perfectly, no adaptor sequences will be sequenced, in other words, no whole sequences will be available.

I run the bcl2fastq without giving SampleSheet and allow the reads to go to undertermined. After parsing the index sequences for each read in undetermined files, I found 29% of reads can be assigned to at least one library (allowing 1 mismatch). We still do not know what the rest 70% reads. I personally feel it is the priming issue cause the problem. If the adaptor primers does not prime at right spot and the sequenced region are not real index region, the whole system will be mess-up.

Have you ever heard of 'Switching Mechanism At 5' end of RNA Template''? It is a Takara SMARTer Stranded Total RNA-Seq Kit. It added three G at 5' of transcript and simplify ligation and RT to one step. I think this three G causes the trouble when doing the sequencing.

ADD REPLY • link 6.9 years ago by sckinta ▴ 730

0

Entering edit mode

Ok. Sounds like your issue is on experimental side rather than the demultiplexing. Check to see if the Takara kit instructions recommend specific steps (for sequencing and for data handling).

ADD REPLY • link 6.9 years ago by GenoMax 151k

score 2 · Answer 1 · 2018-06-20

2

Entering edit mode

6.9 years ago

drkennetz ▴ 560

First step, take the adapter out of the SampleSheet so bcl2fastq doesn't try to trim adapter. Second step, run something like this off the command line:

$bcl2fastq --create-fastq-for-index-reads -R /path/to/run/dir/ -o /path/to/outputdir/ -r 8 - w 8 -p 12

This flag "--create-fastq-for-index-reads" will generate reads for indexes.

ADD COMMENT • link 6.9 years ago by drkennetz ▴ 560

0

Entering edit mode

I don't think OP wants to get separate files for index sequences. Seem to be trying to debug why reads ended up in undetermined files.

ADD REPLY • link 6.9 years ago by GenoMax 151k

2

Entering edit mode

If your reads really were trimmed, then they really matched the adapters you put in the samplesheet. So you don't have to untrim to solve the mystery of what they are.

I second the recommendation to check to see if the reported indices are rev-comps of what you think they should be. I've had that happen a bunch of times, on projects where that step of the PCR is handled by submitters, and not the sequencing group.

ADD REPLY • link 6.9 years ago by swbarnes2 14k

0

Entering edit mode

I am not sure that is the issue. We prepared the library and sequenced them in house. We demultiplexed successfully using the same index sequence direction setting on some other runs. I just recently changed the library prep kit from illumina kit to this Takara SMARTer® Stranded RNA-Seq Kit (with switching technology). It suddenly wont work any more.

ADD REPLY • link 6.9 years ago by sckinta ▴ 730

0

Entering edit mode

Yes, I tried that and be able to extract index for each read. However, after parsing the index sequence with customerized code, only 29% of reads can be assigned to library.

ADD REPLY • link 6.9 years ago by sckinta ▴ 730