bcl2fastq2: how to correctly use the --use-bases-mask for different sequencing methods by Illumina ?
5
3
Entering edit mode
6.2 years ago
▴ 240

Hello,

I need your help to address the parameter found in bcl2fastq2 tool when demultiplexing data generated by Illumina's sequencers. As you know, there are different ways to sequence genomic data but mostly by doing Paired-End (PE) or Single-End (SE) sequencing. Plus, to sequence the data, you have to use single-indexing or double (or dual) indexing on the reads. As per Illumina's definition:

Single and Dual Indexing

The number of index sequences added to samples differs for single-indexed and dual-indexed sequencing.

Single-indexed libraries — Adds up to 48 unique six-base Index 1 (i7) sequences to generate up to 48 uniquely tagged libraries.

Dual-indexed libraries — Adds up to 24 unique eight-base Index 1 (i7) sequences and up to 16 unique eight-base Index 2 (i5) sequences, generating up to 384 uniquely tagged libraries. The IDT for Illumina TruSeq UD Indexes are provided as index pairs and can generate up to 96 uniquely tagged libraries. These indexes add up to 96 unique eight-base Index 1 sequences and up to 96 unique eight-base Index 2 indexes.

During indexed sequencing, the index is sequenced in a separate read, called the Index Read, where a new sequencing primer is annealed. When libraries are dual-indexed, the sequencing run includes two additional reads, called the Index 1 Read and Index 2 Read.

Knowing this, I have two questions:

  1. Is it acceptable to mix single index and dual index on the same flowcell (e.g. Hiseq 4000) knowing that we configured the sequencer as a dual index run ?
  2. How can we demultiplex such data since the file generated by the sequencer (RunInfo.xml) contains configuration for a dual index run ? In other words, demultiplexing lanes that have dual index works fine when providing the RunInfo.xml, but for single index, what should I use for the --use-bases-mask parameter ?

Also, I know that for --use-bases-mask, we can use the following parameters for different types of sequencing:

  • Single-End sequencing: Y * ,I6N *
    • Ovation® SoLo RNA-Seq from nugen/tegan only (see theodore's post below for more details): Y*,I8Y*,Y* (Thanks to theodore)
    • 10x Genomic Single Cell 3' RNA v2 kit + more standard libraries on the same run: Y26n*,I8n*,Y* (Thanks to theodore)
    • 10x Genomic Single Cell 3' RNA v3 and v3.1 kit + more standard libraries on the same run: Y28n,I8n*,Y* (Thanks to theodore)
  • Paired-End sequencing:

    • Dual-Indexing: Y\*,I\*,I\* ,Y\*
    • No Index: Y\*,Y\* (Thanks to Devon Ryan)
    • Single Indexing: Y\*,I6N,Y\* (Thanks to Devon Ryan)
    • In-read barcode in the first read for some of the samples, but the run was PE dual-index: I5Y*,N*,N*,Y* (Thanks to igor)
    • 10x Genomic Single Cell 3' v1 kit: Y98,Y14,I8,Y10 (Thanks to igor)
    • 10x Genomic Single Cell 3' v1 kit + more standard libraries on the same run: Y98N*,Y14N*,I8N*,Y10N* (Thanks to igor)
    • 10x Genomic Single Cell ATAC kit + more standard libraries on the same run: Y50,I8n*,Y16,Y49 (Thanks to theodore)
    • Ovation® SoLo RNA-Seq from nugen/tegan mixed (see theodore's post below for more details): Y*,I8Y*,N*,Y* (Thanks to theodore)

    Also, could you please state what other types of parameters could be used in different cases ? (for future readers)

Thanks for your time and help. Don't forget to upvote this post please so users can find this post.

bcl2fastq2 --use-bases-mask sequencing illumina • 20k views
ADD COMMENT
2
Entering edit mode
6.2 years ago
  1. Yes, though bcl2fastq2 won't be able to handle it in a single step. We commonly do this and we then process each flow cell in compatible chunks, using --tiles. As an example, if the first two lanes of a flow cell have compatible indices (both in number and length) then you need --tiles s_1,s_2. You then also need multiple output directories per flow cell.
  2. See above. In short, you use one --use-bases-mask at a time.

Note that unless you have a mixture of either barcode lengths between lanes or barcode strategies (dual vs. single) you don't actually need --use-bases-mask at all.

For PE and no index you would could use --use-bases-mask Y*,Y*, unless you used an index run. For a single index it'd then be Y*,I6N,Y*.

ADD COMMENT
0
Entering edit mode

Dear Ryan,

Thanks for your reply. The single index has a 6 base pairs length while the dual index has an 8 and all indexes are differnet from one to another. Let's take this RunInfo.xml as example (uploaded on my Google Drive):

https://drive.google.com/open?id=1EJHnNuTyW8BfDLdE4yoBxp78rw8bYsHF

How can I proceed, knowing that for example, lane 5 and 6 are the single index data ?

Thanks

ADD REPLY
1
Entering edit mode

--use-bases-mask Y*,I6nn,nnnnnnnn,Y* in that case.

ADD REPLY
1
Entering edit mode

badredda you could use a separate --use-bases-mask for lanes 5 and 6 and then a different one for other lanes.

ADD REPLY
1
Entering edit mode

I'm passing for a problem like this one, could you help me?

my RunInfo.xml:

<?xml version="1.0"?>
<RunInfo xmlns:xsd="..." xmlns:xsi="..." Version="4">
  <Run Id="190219_NB500954_0035_AHGMJVAFXY" Number="35">
    <Flowcell>HGMJVAFXY</Flowcell>
    <Instrument>NB500954</Instrument>
    <Date>190219</Date>
    <Reads>
      <Read Number="1" NumCycles="151" IsIndexedRead="N" />
      <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
      <Read Number="3" NumCycles="8" IsIndexedRead="Y" />
      <Read Number="4" NumCycles="151" IsIndexedRead="N" />
    </Reads>
    <FlowcellLayout LaneCount="4" SurfaceCount="2" SwathCount="1" TileCount="12" SectionPerLane="3" Lane
PerSection="2">
      <TileSet TileNamingConvention="FiveDigit">
        <Tiles>
          <Tile>1_11101</Tile>
          <Tile>1_21101</Tile>
          <Tile>1_11102</Tile>
          ...
          <Tile>4_11612</Tile>
          <Tile>4_21612</Tile>
        </Tiles>
      </TileSet>
    </FlowcellLayout>
    <ImageDimensions Width="2592" Height="1944" />
    <ImageChannels>
      <Name>Red</Name>
      <Name>Green</Name>
    </ImageChannels>
  </Run>
</RunInfo>

We normally use a 151x8x8x151 amplicon panel, but we added a single indexed panel with 12 index length, I had tried --use-bases-mask Y*,I12,,Y* but I receive the error above:

2019-02-25 21:45:12 [7faca61f4780] ERROR: bcl2fastq::common::Exception: 2019-Feb-25 21:45:12: Success (0): /tmp/bcl2fastq/bcl2fastq/src/cxx/lib/layout/Layout.cpp(378): Throw in function void bcl2fastq::layout::setIndexReadMetadata(const std::vector<long unsigned int>&, bcl2fastq::layout::ReadMetadata&, size_t)
Dynamic exception type: boost::exception_detail::clone_impl<bcl2fastq::common::InputDataError>
std::exception::what: Barcodes in sample sheet are longer than the index length found in RunInfo.xml.

I have tried to change the RunInfo.xml index values to 12 as:

<Read Number="2" NumCycles="12" IsIndexedRead="Y" />
<Read Number="3" NumCycles="12" IsIndexedRead="Y" />

But my FASTQs were empty, any help?

ADD REPLY
4
Entering edit mode

If you only ran 8 bases for the first index, that's all you've got. You can't invent data you don't have by futzing with the command line.

ADD REPLY
1
Entering edit mode

What the single indexed panel the only one on the flow cell or was it mixed with normal length indices? Was it actually 12 bases, or did you dual index it with 6 base indices? If the former is the case then only the first 12 bases of the barcode were actually read and it's going to end up in the undetermined indices no matter what you do. You can write a bit of python to retrieve it then.

ADD REPLY
0
Entering edit mode

Thank you guys! It was mixed with normal length indices (8 bases), and it was 12 bases on one side. The python algorithm should open the Undetermined FASTQ and search for the reads with the possible index in the header?

ADD REPLY
1
Entering edit mode

As @swbarnes2 pointed out above looking at your RunInfo.xml file this run was set up as 151x8x8x151.

<Reads>
  <Read Number="1" NumCycles="151" IsIndexedRead="N" />
  <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
  <Read Number="3" NumCycles="8" IsIndexedRead="Y" />
  <Read Number="4" NumCycles="151" IsIndexedRead="N" />
</Reads>

i.e. with 8 cycles on index 1 and 8 cycles on Index 2. There is NO way to recover data for 12 cycles for Index 1 since those additional 4 cycles were never sequenced.

If 8 bp from Index 1 that were sequenced are discriminatory enough you may be able to recover data but otherwise this run will have to be repeated for the samples with 12 bp indexes.

ADD REPLY
0
Entering edit mode

Thanks, genomax. The 8 bp from index 1 were specific enough to recover than, so I just adjusted the sample sheet used. Best regards.

ADD REPLY
0
Entering edit mode

For SE reads with a single index (NextSeq500) then --use-bases-mask Y*,I6 should be used?

ADD REPLY
3
Entering edit mode
6.2 years ago

Is it acceptable to mix single index and dual index on the same flowcell (e.g. Hiseq 4000) knowing that we configured the sequencer as a dual index run ?

Yes. I do this all the time. Without messing with base masking or subsetting by lane/tile.

Did you try it the easy way first?

ADD COMMENT
0
Entering edit mode

Does that work now? It used to break bcl2fastq2.

ADD REPLY
0
Entering edit mode

I frequently have a mix of samples on one flow cell, some with two indices, some with one. I used to break up into two sample sheets, but I don't now, and it works fine. I can't remember testing having indices of differing lengths, but I think that will work too.

ADD REPLY
0
Entering edit mode

Interesting, I wonder when Illumina enabled this, it would seriously simplify my demultiplexing workflow :)

ADD REPLY
2
Entering edit mode
6.2 years ago
igor 13k

could you please state what other types of parameters could be used in different cases ?

Any parameters are possible. The parameter specifies how you want to interpret the actual sequencing output. You have to make sure that the number of reads and their lengths matches what was ran.

You can use different --use-bases-mask for different lanes or just provide a sample sheet for the lanes that you are interested in.

There are many odd library options. For a hypothetical example, you may have an in-read barcode in the first read for some of the samples, but the run was PE dual-index. Then you might have: I5Y*,N*,N*,Y* (treat first 5 bases of R1 as index and the rest as actual read, ignore I1 and I2, then treat R2 as normal read).

For a real life example, 10x Genomic Single Cell 3' v1 kit required this: Y98,Y14,I8,Y10. This used the second index read as the bcl2fastq index, but kept the other reads for additional processing with more specialized software (Cell Ranger). If you had other more standard libraries on the same run, you would need to add Ns to ignore additional bases: Y98N*,Y14N*,I8N*,Y10N*.

ADD COMMENT
1
Entering edit mode
4.7 years ago
theodore ▴ 90

After a lot of trials and errors I have another exotic option for the community. settings for the Ovation® SoLo RNA-Seq from nugen/tegan lane is completely filled with samples from this library will result in a run with:

Y*,I8Y*,Y*

if the library is mixed with other libraries:

Y*,I8Y*,N*,Y*

RunInfo.xml for those settings is accordingly:

<Reads>
                        <Read Number="1" NumCycles="99" IsIndexedRead="N"/>
                        <Read Number="2" NumCycles="16" IsIndexedRead="Y"/>
                        <Read Number="4" NumCycles="98" IsIndexedRead="N"/>
</Reads>

B. mixed with other library types (from the RunInfo.xml, > reads/index is 101+16+anything+101.):

<Reads>
                        <Read Number="1" NumCycles="99" IsIndexedRead="N"/>
                        <Read Number="2" NumCycles="16" IsIndexedRead="Y"/>
                        <Read Number="3" NumCycles="10" IsIndexedRead="Y"/>
                        <Read Number="4" NumCycles="98" IsIndexedRead="N"/>
</Reads>

Please comment if you find something wrong, although it works for our library preparation

ADD COMMENT
0
Entering edit mode

The user guide explains the demultiplexing and analysis protocol in more depth: https://www.nugen.com/sites/default/files/M01406_v4_User_Guide%3A_Ovation_SoLo_RNA-Seq_System_1287.pdf

ADD REPLY
0
Entering edit mode

What I am providing is a quick answer to the: "For parsing Ovation SoLo libraries with other Illumina sequencers, please contact NuGEN Technical Support" since in the provided manual they only state: Run bcl2fastq2. Use the “--use-bases-mask Y,I8Y” option to generate an index fastq file along with the forward read (for paired end reads use “--use-bases-mask Y,I8Y,Y*”). that is used only if the flowcell is filled with the samples created with the same library.

ADD REPLY
1
Entering edit mode
4.7 years ago
theodore ▴ 90

and to add some more:

10x Genomic Single Cell 3' RNA v2 kit + more standard libraries on the same run

BASES_MASK="Y26n*,I8n*,Y*"

10x Genomic Single Cell 3' RNA v3 and v3.1 kit + more standard libraries on the same run

BASES_MASK="Y28n,I8n*,Y*"

10x Genomic Single Cell ATAC kit + more standard libraries on the same run

BASES_MASK="Y50,I8n*,Y16,Y49"

hope it helps

ADD COMMENT
1
Entering edit mode

What does standard libraries mean?

ADD REPLY
0
Entering edit mode

any regular paired end for example that will have the following pattern Y101,I8,I8,Y101. the I8n* is for cases where the adapter of other libraries in 10nt long, so it will exclude the following 2nts.

ADD REPLY
0
Entering edit mode

I don't think you can call any one protocol "standard" or "regular". You can have single-read or paired-end libraries as well as single or dual index reads. Any combination of those two metrics would be considered standard by most people.

ADD REPLY
0
Entering edit mode

You could call standard or regular the protocols and kits provided by illumina, in the past, and those include single end, paired end, single or dual indexed and are being used for more than a decade. If you have a better wording let me know. If you think that my additions are offering nothing useful again let me know and I will be happy to erase both entries.

ADD REPLY

Login before adding your answer.

Traffic: 2102 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6