Hello,
I need your help to address the parameter found in bcl2fastq2 tool when demultiplexing data generated by Illumina's sequencers. As you know, there are different ways to sequence genomic data but mostly by doing Paired-End (PE) or Single-End (SE) sequencing. Plus, to sequence the data, you have to use single-indexing or double (or dual) indexing on the reads. As per Illumina's definition:
Single and Dual Indexing
The number of index sequences added to samples differs for single-indexed and dual-indexed sequencing.
Single-indexed libraries — Adds up to 48 unique six-base Index 1 (i7) sequences to generate up to 48 uniquely tagged libraries.
Dual-indexed libraries — Adds up to 24 unique eight-base Index 1 (i7) sequences and up to 16 unique eight-base Index 2 (i5) sequences, generating up to 384 uniquely tagged libraries. The IDT for Illumina TruSeq UD Indexes are provided as index pairs and can generate up to 96 uniquely tagged libraries. These indexes add up to 96 unique eight-base Index 1 sequences and up to 96 unique eight-base Index 2 indexes.
During indexed sequencing, the index is sequenced in a separate read, called the Index Read, where a new sequencing primer is annealed. When libraries are dual-indexed, the sequencing run includes two additional reads, called the Index 1 Read and Index 2 Read.
Knowing this, I have two questions:
- Is it acceptable to mix single index and dual index on the same flowcell (e.g. Hiseq 4000) knowing that we configured the sequencer as a dual index run ?
- How can we demultiplex such data since the file generated by the sequencer (RunInfo.xml) contains configuration for a dual index run ? In other words, demultiplexing lanes that have dual index works fine when providing the RunInfo.xml, but for single index, what should I use for the --use-bases-mask parameter ?
Also, I know that for --use-bases-mask, we can use the following parameters for different types of sequencing:
- Single-End sequencing:
Y * ,I6N *
- Ovation® SoLo RNA-Seq from nugen/tegan only (see theodore's post below for more details):
Y*,I8Y*,Y*
(Thanks to theodore) - 10x Genomic Single Cell 3' RNA v2 kit + more standard libraries on the same run:
Y26n*,I8n*,Y*
(Thanks to theodore) - 10x Genomic Single Cell 3' RNA v3 and v3.1 kit + more standard libraries on the same run:
Y28n,I8n*,Y*
(Thanks to theodore)
- Ovation® SoLo RNA-Seq from nugen/tegan only (see theodore's post below for more details):
Paired-End sequencing:
- Dual-Indexing:
Y\*,I\*,I\* ,Y\*
- No Index:
Y\*,Y\*
(Thanks to Devon Ryan) - Single Indexing:
Y\*,I6N,Y\*
(Thanks to Devon Ryan) - In-read barcode in the first read for some of the samples, but the run was PE dual-index:
I5Y*,N*,N*,Y*
(Thanks to igor) - 10x Genomic Single Cell 3' v1 kit:
Y98,Y14,I8,Y10
(Thanks to igor) - 10x Genomic Single Cell 3' v1 kit + more standard libraries on the same run:
Y98N*,Y14N*,I8N*,Y10N*
(Thanks to igor) - 10x Genomic Single Cell ATAC kit + more standard libraries on the same run:
Y50,I8n*,Y16,Y49
(Thanks to theodore) - Ovation® SoLo RNA-Seq from nugen/tegan mixed (see theodore's post below for more details):
Y*,I8Y*,N*,Y*
(Thanks to theodore)
Also, could you please state what other types of parameters could be used in different cases ? (for future readers)
- Dual-Indexing:
Thanks for your time and help. Don't forget to upvote this post please so users can find this post.
Dear Ryan,
Thanks for your reply. The single index has a 6 base pairs length while the dual index has an 8 and all indexes are differnet from one to another. Let's take this RunInfo.xml as example (uploaded on my Google Drive):
https://drive.google.com/open?id=1EJHnNuTyW8BfDLdE4yoBxp78rw8bYsHF
How can I proceed, knowing that for example, lane 5 and 6 are the single index data ?
Thanks
--use-bases-mask Y*,I6nn,nnnnnnnn,Y*
in that case.badredda you could use a separate
--use-bases-mask
for lanes 5 and 6 and then a different one for other lanes.I'm passing for a problem like this one, could you help me?
my
RunInfo.xml
:We normally use a 151x8x8x151 amplicon panel, but we added a single indexed panel with 12 index length, I had tried
--use-bases-mask Y*,I12,,Y*
but I receive the error above:I have tried to change the
RunInfo.xml
index values to 12 as:But my FASTQs were empty, any help?
If you only ran 8 bases for the first index, that's all you've got. You can't invent data you don't have by futzing with the command line.
What the single indexed panel the only one on the flow cell or was it mixed with normal length indices? Was it actually 12 bases, or did you dual index it with 6 base indices? If the former is the case then only the first 12 bases of the barcode were actually read and it's going to end up in the undetermined indices no matter what you do. You can write a bit of python to retrieve it then.
Thank you guys! It was mixed with normal length indices (8 bases), and it was 12 bases on one side. The python algorithm should open the Undetermined FASTQ and search for the reads with the possible index in the header?
As @swbarnes2 pointed out above looking at your
RunInfo.xml
file this run was set up as 151x8x8x151.i.e. with 8 cycles on index 1 and 8 cycles on Index 2. There is NO way to recover data for 12 cycles for Index 1 since those additional 4 cycles were never sequenced.
If 8 bp from Index 1 that were sequenced are discriminatory enough you may be able to recover data but otherwise this run will have to be repeated for the samples with 12 bp indexes.
Thanks, genomax. The 8 bp from index 1 were specific enough to recover than, so I just adjusted the sample sheet used. Best regards.
For SE reads with a single index (NextSeq500) then
--use-bases-mask Y*,I6
should be used?