Question

undermined fastq and top unknown barcodes

0

Entering edit mode

9 days ago

1769mkc ★ 1.2k

print(matched_barcodes)

 Unknown_Barcode    Match
1  AAGACACT+TGCTGTCA No Match
2  GTCCACAG+TATTCGCG No Match
3  GGGGGGGG+AGATCTCG No Match
4  TCTGCAAG+AAGGTGAA No Match
5  ATTATGTT+GAATACAG No Match
6  AGTAAGCG+GCGAATGA No Match
7  CACATCCT+ATGGTATT No Match
8  ACACGATC+AGTGCAGC No Match
9  AGCATGGA+GTAGCGCT No Match
10 CAGCAAGG+GGTAGAGG No Match

So these are the top unknown barcodes which I tried to find in the samplesheet which was used to generate fastq. As most of them are going into the undertmined fastq. So the next thing I tried to look for partial match across the samplesheet

I get this

    Sample  index   index2  Unknown_Barcode Match_Type
Samp_B  CACATCCT    AGTGCAGC    CACATCCT+ATGGTATT   index1
Samp_C  ACACGATC    ATGGTATT    CACATCCT+ATGGTATT   index2
Samp_D  ACACGATC    ATGGTATT    ACACGATC+AGTGCAGC   index1
Samp_E  CACATCCT    AGTGCAGC    ACACGATC+AGTGCAGC   index2
Samp_F  AGCATGGA    TTAGCGCT    AGCATGGA+GTAGCGCT   index1
Samp_G  CAGCAAGG    GTTAGAGG    CAGCAAGG+GGTAGAGG   index1


--barcode-mismatch 0 was used.

So what is way to figure out what went wrong with bcl2fastq .

What are the troubleshooting steps I should follow to find samplesheet or what parameter should be used in bcl2fastq to reduce the undermined fastq.

bcl2fastq • 920 views

ADD COMMENT • link updated 7 days ago by GenoMax 148k • written 9 days ago by 1769mkc ★ 1.2k

1

Entering edit mode

8 days ago

swbarnes2 14k

How many samples do you have? How many index pairs do you have in the fastq, assuming you can distinguish the real pairs from noise?

At a glance, it looks like index 1 and index 2 are not paired the way you think. You are going to have to hope that the mixup follows a simple rule, like index 2 is one slot off from what it's supposed to be.

ADD COMMENT • link 8 days ago by swbarnes2 14k

0

Entering edit mode

304 samples what I see in the samplesheet what I got ," How many index pairs do you have in the fastq, assuming you can distinguish the real pairs from noise?" you mean unique index pair?

ADD REPLY • link 8 days ago by 1769mkc ★ 1.2k

2

Entering edit mode

You have 304 samples?

Did you prepare the libraries? Can you talk to whomever added the indices, to see if they can explain how they should be paired?

If there are 304 samples, there should be about 304 index1+index2 sets with quite a lot of reads, and everything else should be a lot less.

ADD REPLY • link 7 days ago by swbarnes2 14k

0

Entering edit mode

Im not involved in the libraries part , I'm right now in the troubleshooting part. So what is the general rule if lots of samples to add index? Do they need to consider the hamming distance which I saw in illumina guidelines, the data what I have got they have used gendx index.

Yes ,have 304 samples and the issue was in one particular lane whose SAV data looks all fine as well.

The particular lane as has mixture of samples

ADD REPLY • link 7 days ago by 1769mkc ★ 1.2k

0

Entering edit mode

"there should be about 304 index1+index2 sets with quite a lot of reads, and everything else should be a lot less." There are repetition of indices, I will add the both the index counts.

ADD REPLY • link 7 days ago by 1769mkc ★ 1.2k

score 3 · Accepted Answer · 2024-12-13

3

Entering edit mode

9 days ago

GenoMax 148k

Find the top indexes present in "Undetermined" pool file. Then work your way backwards to finding what the issue is with the samplesheet you provided.

Use the code in Demultiplexing reads with index present in the labels to get that information.

ADD COMMENT • link 9 days ago by GenoMax 148k

0

Entering edit mode

will give it a try and update it

ADD REPLY • link 9 days ago by 1769mkc ★ 1.2k

0

Entering edit mode

 bbmap/./demuxbyname.sh -Xmx20g in=Undetermined_Reads_R1.fastq.gz in2=Undetermined_Reads_R2.fastq.gz out=%_R1.fq.gz out2=%_R2.fq.gz suffix names=index.txt prefixmode=f
java -ea -Xmx20g -Xms20g -cp /home/bbmap/current/ jgi.DemuxByName2 -Xmx20g in=NUCLEOME/Undetermined_Reads_R1.fastq.gz in2=Undetermined_Reads_R2.fastq.gz out=%_R1.fq.gz out2=%_R2.fq.gz suffix names=index.txt prefixmode=f
Executing jgi.DemuxByName2 [-Xmx20g, in=Undetermined_Reads_R1.fastq.gz, in2=Undetermined_Reads_R2.fastq.gz, out=%_R1.fq.gz, out2=%_R2.fq.gz, suffix, names=index.txt, prefixmode=f]

Set INTERLEAVED to false
Input is being processed as paired
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
java.lang.AssertionError:
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
        at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:503)
        at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:368)
        at stream.ConcurrentGenericReadInputStream.run0(ConcurrentGenericReadInputStream.java:208)
        at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:183)
        at java.base/java.lang.Thread.run(Thread.java:834)

This is what i see the when I run the demuxbyname.sh

ADD REPLY • link 8 days ago by 1769mkc ★ 1.2k

2

Entering edit mode

There appear to be different numbers of reads in the paired input files.

That is very odd. If the files came out from the bcl2fastq run then that should not have happened. Is it possible that your original bcl2fastq run did not completely finish. You could try repair.sh to fix the sync but see how many singletons you end up with.

I would suggest that you first try to rectify the samplesheet based on the indexes you see in Undetermined file with the awk code I had linked. Then repeat the demultiplexing with bcl2fastq. demuxbyname.sh should be the second option, if you are not able to get bcl2fastq to demultiplex the data properly.

If you made the SampleSheet file up manually then you may want to use Illumina experiment manager software (Windows only) to create one, especially if this is something you don't do regularly. As swbarnes2 noted your index combinations may be shifted by one row or something along that line.

ADD REPLY • link 8 days ago by GenoMax 148k

0

Entering edit mode

'You could try repair.sh to fix the sync but see how many singletons you end up with." this I did try as well and it same issue there. I would go ahead with your suggestion and for the IEM and try.

"As swbarnes2 noted your index combinations may be shifted by one row or something along that line." Can you explain this you mean this might be a manual error while making the samplehseet?

ADD REPLY • link 8 days ago by 1769mkc ★ 1.2k

0

Entering edit mode

Once you run the awk code on your undetermined_R1 file show us what you get with

grep -w "AAGACACT"
grep -w "TGCTGTCA"

Also show us 10 top rows of resulting barcodes and the read numbers.

ADD REPLY • link 7 days ago by GenoMax 148k

0

Entering edit mode

 zcat Undetermined_Reads_R1.fastq.gz | head
@A02125R:106:HVFVVDSXC:3:1101:20491:1000 1:N:0:AAGACACT+TGCTGTCA
CCCACGAGCTACGGCTGCATACTTCACCATGCTGTACACACGCAATGGTTGCTTCCTAAGACCAACACTGTGAGCTAGTCCTGCTGTGAAGGAGTGACTTGAATTTGACTTCTCAACCGGTTCACAGAAAGTTGCCAATGCTGGATTCTCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF
@A02125R:106:HVFVVDSXC:3:1101:22824:1000 1:N:0:AAGACACT+TGCTGTCA
CCCATTTGAATGTGCCGATATCAGCGGTTTCATGTCGACCAGTGAGACGGCGTTCATTGCCTTCACCATAAGCAGCAATGTGCTCTTTGTGCCTATGACCAAGCTTTTCAATTGCCTTTTTGATAACCTCAAAACCTCCAGATCGGAAGAG
+
FF:FFFF:FFFF:FFF:FFFFFFFF:FFF,FFFFFF,FF:FFFFF:FFF:F:F:FFFFFF:FFFFFFFFFFFFFFFFFFFF::FFFFFFF:FFF,FFFF,FF,F:FFFFFFF:F:FF,F:FFFFFFFFF:FFF,FF,FFFFFFFFFFF:FF
@A02125R:106:HVFVVDSXC:3:1101:24234:1000 1:N:0:AAGACACT+TGCTGTCA
GACGTTTTCGGGATGCCCCTAATAACCAACGAACGGCAAGTGCTTTTCCTTGTGCGGATCCTATTTCAATGGGAACTTGATGAGTCGATCCGCCTACACGTCTTGCTTTTACTGCTATATCGGGAGTTACTCCACGTATTGCTTGACGTAA



grep -w "AAGACACT" output_R1.txt | wc -l
13455

grep -w "TGCTGTCA" output_R1.txt | wc -l
12943


cat output_R1.txt | head
TGGTACTA+CAGGATAA: 1
GTAAACAA+GGGCTCGC: 1
CTACGCGA+ACACGATC: 1
CCCATCCA+ACAGCATC: 4
ATAACACA+AAAATATA: 2
AATAACGA+GGCACTAA: 1
TGCGTCAT+TGATGGGG: 1
CCTAGTAC+CAATCTAC: 1
TCCATACC+ACATAGGA: 1
ATAACACA+AAAATATC: 5

ADD REPLY • link 7 days ago by 1769mkc ★ 1.2k

2

Entering edit mode

TGGTACTA+CAGGATAA: 1
GTAAACAA+GGGCTCGC: 1
CTACGCGA+ACACGATC: 1
CCCATCCA+ACAGCATC: 4

You have not said what kind of sequencer this run is from but most of these index combinations are likely not usable with only a few reads assigned. You will want to find index combinations (sort this output) and then see if you can correlate the top indexes to the samplesheet you were provided. You may need to rev-comp one of the indexes (that is likely the easiest error) or you may need to swap i7-i5 columns (another common error) to match the samples to indexes seen.

If neither of these cure the problem then you will have to send the index combinations that look real (have significant number of reads) to the people in lab so they can figure out what went wrong. What you are handing them is what the sequencer saw (truth), irrespective of what they think should be there (assuming there were no issues with sequencer run).

Once the corrections are made, it should then just be a matter of creating a new samplesheet and re-demutilplexing the data.

ADD REPLY • link 7 days ago by GenoMax 148k