Hi,
I have MiniSeq run (single-end read sequencing)
data. All of my samples are dual index and all indices have 8nt length.
Below is the RunInfo.xml file:
<RunInfo xmlns:xsd="<a href=" http:="" www.w3.org="" 2001="" XMLSchema"="" rel="nofollow">http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Version="4">
<Run Id="201029_MN00153_0075_A000H37WNG" Number="75">
<Flowcell>000H37WNG</Flowcell>
<Instrument>MN00153</Instrument>
<Date>201029</Date>
<Reads>
<Read Number="1" NumCycles="151" IsIndexedRead="N"/>
<Read Number="2" NumCycles="8" IsIndexedRead="Y"/>
<Read Number="3" NumCycles="8" IsIndexedRead="Y"/>
</Reads>
<FlowcellLayout LaneCount="1" SurfaceCount="2" SwathCount="3" TileCount="10" SectionPerLane="1" LanePerSection="1">
<TileSet TileNamingConvention="FiveDigit">
<Tiles>
<Tile>1_11102</Tile>
<Tile>1_21102</Tile>
.
.
.
</Tiles>
</TileSet>
</FlowcellLayout>
<ImageDimensions Width="2592" Height="1944"/>
<ImageChannels>
<Name>Red</Name>
<Name>Green</Name>
</ImageChannels>
I also have my SampleSheet.csv in the below format:
[Header]
Local Run Manager Analysis Id,7007
Experiment Name,2020-10-29
Date,2020-10-29
Module,GenerateFASTQ - 2.0.1
Workflow,GenerateFASTQ
Library Prep Kit,Nextera DNA CD Indexes - 96 Indexes Plated
Chemistry,Amplicon
[Reads]
151
[Settings]
adapter,CTGTCTCTTATACACATCT
[Data]
Sample_ID,Sample_Name,Description,Index_Plate_Well,index,I7_Index_ID,index2,I5_Index_ID,Sample_Project
1,1,,A01,ATTACTCG,H701,TATAGCCT,H505,
2,2,,A02,ATTACTCG,H702,ATAGAGGC,H506,
3,3,,A03,ATTACTCG,H703,CCTATCCT,H517,
4,4,,A04,ATTACTCG,H705,GGCTCTGA,H505,
5,5,,A05,ATTACTCG,H707,AGGCGAAG,H506,
6,6,,B01,ATTACTCG,H702,TAATCTTA,H517,
7,7,,B02,ATTACTCG,H703,CAGGACGT,H505,
8,8,,B03,ATTACTCG,H701,GTACTGAC,H506,
9,9,,B04,TCCGGAGA,H707,TATAGCCT,H517,
10,10,,B05,TCCGGAGA,H723,ATAGAGGC,H505,
11,11,,C01,TCCGGAGA,H703,CCTATCCT,H506,
.
.
.
I used the below command line:
>bcl2fastq --runfolder-dir /proj/data --output-dir /proj/data --sample-sheet /proj/data/SampleSheet.csv --use-bases-mask Y*,I8n*,I8n*,Y* --barcode-mismatches 0**
But I get an error in --use-bases-mask Y*,I8n*,I8n*,Y*
. I am not sure whether this is a suitable --base-mask approach for single-end reads with dual index.
The error:
ERROR: bcl2fastq::common::Exception: 2020-Nov-05 17:44:00: Success (0): /sw/apps/bioinfo/bcl2fastq/2.20.0/src/bcl2fastq/src/cxx/lib/layout/UseBasesMask.cpp(61): Throw in function bcl2fastq::layout::UseBasesMask::UseBasesMask(std::string, std::vector<bcl2fastq::layout::ReadMetadata>::const_iterator, std::vector<bcl2fastq::layout::ReadMetadata>::const_iterator)
Dynamic exception type: boost::exception_detail::clone_impl<bcl2fastq::layout::UseBasesMaskFormatError>
std::exception::what: UseBasesMask formatting error. A mask must be specified for each read. Number of reads: 3 Base masks: 'y*,i8n*,i8n*,y*'
Can any one please help me with this issue?
@genomax Thank you for your comment, it helped to get rid of the
--use-bases-mask
error. But I am now facing another problem which is getting just one Fastq file Undetermined. Would you please suggest any help to get all of the Fastq files properly? Any checkpoints? Can this problem happen because of the Adapter sequence in my SampleSheet.csv file?Double check your SampleSheet.csv above. Did you make it using Illumina's Experiment Manager? Your
[Data]
section headers do not appear to be right.It is automatically generated by a sequencing instrument. Which columns should be removed? I also want to bring it to your notice that each group of samples (for example 8) have the same I7 indices and all of their I5 indices differ. [It is clear if you double-check my SampleSheet.csv file above]
If the sequencer made the samplesheet then it must be right for use with on-board demultiplexing but looks like you are using
bcl2fastq
off-sequencer.If you see in your undetermined file that invariant index (which you think is i7) is actually in second position. So you may need to flip those indexes in your
SampleSheet.csv
. Try that out.I just double-checked. Reported indices on the Undetermined Fastq file header do not exist in the SampleSheet.csv - Can this be the problem?
Definitely. Make sure you have the correct entries in SampleSheet.csv. Double check with whoever made the libraries if needed.
If nothing works then I will point you to some code I have here that will tell you a list of indexes present and their numbers: C: Demultiplexing reads with index present in the labels
Hi @genomax, if you saw my latest answer you are probably noticed that I got successful to assign reads to each sample.fastq - my question is now that my code gives results but I get very low reads for each fastq file - about 22kb for actual samples--and about 1.7gb for the undetermined fastq file. Would you help me to figure out what step can be more critical to take care of? With or without the
--use-base-mask
I get the same result. And I also provided the RC of the second index in the SampleSheet.csv.That means something is still wrong. Use the code I mentioned in my comment above to figure out which sequences are ending up in the
Undetermined
file and work out what needs to be done.I ran your code on the
undetermined
file and here is just the header of the sorted version (based on the counts). I do not have theAGATCTCG
either in my i7 or i5 indices in the SampleSheet.csv file. Any recommendations?I am afraid something is not right. I have a feeling that
index 1
read has failed (in order listed above). MiniSeq is a 2-color machine so all those G's indicate no signal.As for
AGATCTCG
are you sure there was no error in your original SampleSheet with the invariant index? That is clearly what the sequencer sees but again, if there was an index read failure who knows ....I am going to suggest that you contact Illumina tech support and have them take a look at this run remotely. If there was a machine/reagent issue then they will replace the reagents as long as you have a maintenance contract on this sequencer.
While waiting on a return call from Illumina, take a random set of reads from Undetermined file and blast them at NCBI see if they are all phiX by chance. If some belong to the genome you are working with then it would support my observation that there was some kind of index read failure. If they are all phiX then your libraries may have failed.
Thank you so much for your detailed inputs. I will consider all and will also update you about the final decision.
The read excerpts below indicate that the sequence associated with those indices is PhiX. I don't think those are supposed to have indices at all, so I think the missing index is okay.
Dear @genomax, by asking from our wet lab guy, I just noticed that the adapters assigned by the sequencing instrument belong to a different library preparation kit (Nextera) however for our experiment we used the Truseq CD library kit. Can this cause the problem? Should I remove the Nextera information from the SampleSheet?
If you think the wrong library got sequenced then replace the wrong library information in samplesheet and see if the samples demultiplex. If they do then you have your answer. It is possible that the wrong adapters were used by mistake to make the library (human error).
Hi again, after a long search we still lack how to proceed with this. We decided to run bcl2fastq software naively (not for demultiplexing) in order to get 3 different fastq files reporting actual inserts, index1 and index2 - do you have any experience with this kind of runs?
Did you talk with Illumina tech support? I think that is your best option. If this run had failed on index reads then you may be chasing ghosts trying to demultiplex this data.
You can run
bcltofastq
by using option--create-fastq-for-index-reads
to create separate files for index reads. Don't provide any index info if you want to send all reads to "Undetermined" files.Thank you so much for your kind help, much appreciated. I agree with you- I am also getting frustrated kind of. I have not heard anything from them yet but what we want to push this to its limit - we think probably the
bcl2fastq
do not catch correct index info from thesamplesheet.csv
. So with--create-fastq-for-index-reads
I will only get fastq files for indexes? How about the actual inserts? Is there any option to get 3 separate fastq files (for the insert, index1, and index2 respectively)?You will get three separate files. Read1, Index 1 and Index 2.
If the run has failed during sequencing then the data you have in hand would be completely unreliable. Examples of indexes you have posted above seems to indicate this outcome.
Using
--create-fastq-for-index-reads
is the final step for us. I think with these data we can definitely come up with a conclusion whether there were index read failures or not. Also some thin about the second index that I posted above, it is not among our index list (neither 1 nor 2).Dear @genomax, here I have an update as well as a question for you. I ran bcl2fastq in a naive way with
--create-fastq-for-index-reads
. After the successful naive run, I randomly chose five index 1 indices + five index 2 indices + and their reverse complements and look at their abundances in these three files. It seems that I2 provides information about Index2, however, I1 does not give information about Index1. On the other side, we do detect I1 in R1 itself.Question: Based on these results, I do not understand why, as the read starts right before the insert and can go until index 1. In theory index 2 should not be detected
Also one more thing I need to mentioned about the
AGATCTCG
index2 above. It is exactly the reverse complement sequence next to index primer binding position at TruSeq Universal P5 adapter.If you are looking for things to try, try adding a project name, then your fastqs should end up in a folder with that name. (I also recommend not running with barcode-mismatches = 0. You will throw away a bunch of fine reads.)
My undetermined Fastq file header is:
If you blast those sequences, you'll see they are PhiX.