Hello everyone,
I recently ran a NextSeq 2000 using 6-nucleotide Illumina TruSeq unique indices. My goal is to demultiplex using bcl2fastq and extract only the reads that match my indices. However, there's a complication: the run included samples from another person whose 8-nucleotide R1 indices overlap with my indices.
I'm looking for advice on:
How can I effectively run the bcl2fastq tool to extract only the reads that exhibit a precise match with my own 6-nucleotide indices?
Is it possible to execute the demultiplexing process while concurrently generating distinct fastq files for the indices? This would let me match index entries to reads in the sample fastq files, enabling removal based on the index fastq file.
Any insights on these methods would be appreciated. Thank you.
How similar are the 8mers to your 6mers? Like perfect overlap and just 2 bases longer or "similar"?
It is a perfect overlap like this "ATCGAA" vs "ATCGAAGG". Basically, the last two nucleotides are different.
Did you run the actual run with 6 cycles on index? Are these single index samples?
If the answer is yes to both then you are not going to be able to discern the sample during demultiplexing. If the other persons samples are a different species then you may need to use
bbsplit.sh
to bin the reads. That would be about the best you can do.If you ran the run with 8 index cycles then you should be able to separate the samples.
This is the run information:
In my samples, R1 reads indices overlap with other person indices. I can demultiplex using only the R1 index but other person demultiplexing should be done providing R1 and R2 indices. Thus he does not have a problem retrieving his reads but I do because my 6nts indices overlap as described above.
I already ran demultiplexing using my indices but then I have many reads that do not belong to my experiment. This is because my indices completely overlap with the 8nts indices as: "ATCGAA" vs "ATCGAAGG". How I can overcome this?
Here is my samplesheet structure:
Since you only have 6 bp indexes, your samples should show up with an extra
AT
(i don't immediately recall the two bases) so look for these. You could then addAT
in your indexes to differentiate your samples from others. Your samples will also show a phantom index that is not a real i5 index. So that can also be added to the samplesheet.This is going to take some finagling to sort out. As you have realized, it is not a good idea to have overlapping indexes in a run and a mix of 1D and 2D indexed samples.
Sorry I did not get it. My indexes are 6nts and samples that are not mine have 8nts indexes that have
AT
as extra nucleotides at the end. Is there any straightforward way to retrieve my reads? I was thinking of running the conversion in a way to report indexes (8nts) for each entry and then I only keep read entries that 6nts of reported indexes match my indexes and the last two nucleotides are notAT
. what do you think?Now I am confused. Other person who had samples on this flowcell also had 6 bp indexes? I thought they had 8 bp dual indexes and you have 6 bp single indexes. Is that not the case?
When you run sequencing longer than the actual index length those extra bases show up (I think they are generally
AT
). If both of you had indexes of identical length then your only option is to separate the reads based on alignments, assuming the genomes are different enough.yes that is true, they had 8 bp dual indexes and I have 6 bp single indexes. So with what I said in my previous comment, is there any suitable way?
Can you run the code that is here : Demultiplexing reads with index present in the labels and show us what combination of indexes are present in your data. Do this preferably with non-demultiplexed data. You can create non-demux data by using a blank samplesheet (without any sample lines). That will put all reads in "Undetermined" files.
I am going to posit that reads that have correct index 1 but the non-real index 2 are going to be yours.
Thank you for your input. I ran demultiplexing with a blank samplesheet and the
--create-fastq-for-index-reads
option. Now I have R1 and R2 Undetermined reads together with fastqs for the indexes (8nts). Now I think I can look for my indexes in the R1 index fastq file that their first 6 nucleotides match with my indexes and the last two are notAT
and those are my reads. What do you think?Please show the output of the
awk
command I had asked you to run.I ran it on the Undetermined_R1 fatsq file but I got this result
That is odd. Are the index sequences not in the headers? Can you show one example header?
This it the head of Undetermined_R1:
But by running the demultiplexing step without providing any index, I did get fastq files of R1 and R2 indexes. Here is the head of R1 index fastq file:
Do you think that I can look for my indexes in the R1 index fastq file that their first 6 nucleotides match with my indexes and the last two are not AT and those are my reads?
Sorry my apologies. I should have said run the demultiplexing with one dummy sample name that looks something like this:
We are using 8 N's because we need to get all 8 bases that were sequenced for both indexes irrespective of the sample. This should properly populate the fastq headers with index sequences. Then run the awk script on these files.