Question

Can index hopping lead to more reads in samples?

0

Entering edit mode

15 months ago

Assa Yeroslaviz ★ 1.9k

We run multiple samples for sequencing on an Illumina NovaSeq machine. After converting the files to fastq format using bcl2fastq, we can see that we have some trouble with index hopping.

The image attached here shows the structure of the indices, how they are supposed to be and how we can see them after the conversion.

The color coding at the top of the image shows the four samples in question (names are in the first column to the left). The right column explains how the barcodes were supposed to be paired together. The Top Unknown Barcodes at the bottom part of the image shows how they were found by the conversion tool.

index hopping

Interestingly, the two samples 6-2 and 8-2 show the highest number of reads in the complete data set (contains 30 samples) with around 20M reads, while the two samples 1-1 and 3-1 are both at the bottom of the list with the lowest number of assigned reads.

My question is whether these two results are connected. As far as I understand, if the two barcodes are not identified, the read is automatically classified as Unknown. But is it possible that somehow reads from e.g sample 3.1 were assigned to sample 6-2 by mistake, or reads from sample 1-1 were saved under sample 8-2?

To me it seems to be too much of a coincidence to see the two samples with the highest and lowest number of reads being all connected in the barcode swapping event.

Any advice would be appreciated.

cross-posted here, but got no response

index-hopping NovoSeq illumina NGS Seqeucning • 2.2k views

ADD COMMENT • link 15 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

What is the % of these index hopped reads compared to demultiplexed data? Was this run borderline overloaded?

If you are able to do it then I suggest that you demultiplex the data using Illumina bcl-convert instead of bcl2fastq.. It produces an explicit report for index hopping.

ADD REPLY • link 15 months ago by GenoMax 152k

0

Entering edit mode

I will try bcl-convert if I can just find out how to install it on ubuntu :-)

ADD REPLY • link 15 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

I manage to install the tool. When running it I get the following warning:

WARNING: Insufficient hamming distance in the i7 index sequences to identify index hopped reads in lane 1. Index hopping report is only produced for unique dual indexes.
WARNING: Insufficient hamming distance in the i7 index sequences to identify index hopped reads in lane 2. Index hopping report is only produced for unique dual indexes.

and after the run is finish the file Index_Hopping_Counts.csv is empty.

Any idea what it is or how to change it, if possible?

ADD REPLY • link 15 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Looks like your i7 indexes are not diverse enough. If you had allowed 1 error in indexes during demultiplexing try using only perfect matches with setting below in your samplesheet.

[Settings],,,,,,,,,
BarcodeMismatchesIndex1,0,,,,,,,,
BarcodeMismatchesIndex2,0,,,,,,,,

ADD REPLY • link 15 months ago by GenoMax 152k

0

Entering edit mode

Yes, I tried this as well. The warning still shows up and the file is still empty

ADD REPLY • link 15 months ago by Assa Yeroslaviz ★ 1.9k

score 0 · Answer 1 · 2024-04-03

0

Entering edit mode

15 months ago

inedraylig ▴ 70

It's not clear to me if the samples were already demultiplexed or not. If they were demultiplexed, it's not clear to me how the software decides on assigning reads, but very likely the decision is made based on both indexes and not on one. some demultiplexers like deML are more transparents in their reports, so you can try it.

Are the results connected? Considering that 8-2 has the highest number of reads, one would expect the index used for this sample would be present in the highest numbers, also in indexes that hopped. So yes, deeply sequenced samples would mean deeply sequenced indexes, which is not a coincidence. As for the samples with low numbers of reads, losing 50% of the sequencing to index hopping is way above what's so far been reported. However if free adapters were not completely removed during library preparation, this may lead both to unusual index combinations and low sequencing depth, which may be an explanation to what you're seeing.

If your samples were demultiplexed, you can query your fastq or sam files for the index sequences and count the number of reads corresponding to each index combination, giving you an idea about reads that were not well assigned. Otherwise, you can always try a different demultiplexing approach.

ADD COMMENT • link 15 months ago by inedraylig ▴ 70

0

Entering edit mode

Yes, the samples were demultiplexed using bcl2fastq (GenoMax is correct). From what I understand how the tool works, it uses the barcodes to assign a read to a specific sample based on the barcodes ligated to it (i5 and i7) at the library preparation step of the Illumina procedure. Unfortunately the fastq file doesn't contains any barcodes anymore, as they were discarded during the demultiplexing. I don't think bcl2fastq has the option to keep the barcodes in the read.

Samples 8-2 and 6-2 show the highes number of reads, but looking at the barcodes in the fastq header, they also have the correct barcodes. It seems that this is probably not the source of the index hopping, but the question is, whether or not it is possible, that reads from the two lowest samples (3-1 and 1-1) were somehow attached to the two other samples in a way that they were stolen from them, while still maintaining the correct barcodes structure. Something like that:

L1<8-1.1><3-1.2>R1 
L1<3-1.2><8-2.1>R1

L1 and R1 are the barcodes for sample 8-1, but they enclose reads from both samples. Can something like that happen?

ADD REPLY • link 15 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Unfortunately the fastq file doesn't contains any barcodes anymore, as they were discarded during the demultiplexing.

That is not correct. During demultiplexing the indexes are transferred to the fastq headers of demultiplexed data.

L1 and R1 are the barcodes for sample 8-1, but they enclose reads from both samples. Can something like that happen?

I am not sure how you can say that they enclose reads from both samples. You won't know which sample a read belongs to until after it has been demultiplexed.

Have you seen: https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf

ADD REPLY • link 15 months ago by GenoMax 152k

0

Entering edit mode

That is not correct. During demultiplexing the indexes are transferred to the fastq headers of demultiplexed data.

I meant here, they are not kept within the reads

thanks for the paper. I knew of it, haven't read it yet. But I'll see what they say.

maybe it is just a random event, that the two samples with the highest number of reads are also those that participate in the the index hopping.

ADD REPLY • link 15 months ago by Assa Yeroslaviz ★ 1.9k