Most data are within Undetermined.fastq, while sample files are empty
2
1
Entering edit mode
2.8 years ago

Hello! I'm an MD in my background currently running the project on tumour metagenomics. We collected patient samples and analyzed them on Illumina (low microbiological abundance). There is output data uploaded from Illumina. The current issue is as follows: apparently due to the fact that the barcodes were incorrectly specified, the program incorrectly divided the Reads into samples. As a result, almost all sample files are almost empty, and the majority of the information (as concluded by comparing the files size) is stored within the Undetermined.fastq files. I'm trying to extract the information and store it into correct separate files. The problem is that I'm not very experienced in doing that and we are facing troubles while contacting our bioinformatician.

I was proposed to "restart the analyzer program (Illumina MiSeq Reported) and use the correct bards so that the reads are broken down by samples. And, as the next step, analyze the data obtained".

I'm seeking for any advice, as specific as possible. If there is anything I can do by myself, please let me know (I'm familiar with R). Otherwise I would highly appreciate other open-source or paid solution to finally obtain the results. We've done huge amount of work collecting those samples and I feel frustrated to loose it.

Any advice is appreciated. Best regards.

Illumina metagenomics analysis • 6.2k views
ADD COMMENT
0
Entering edit mode

Dear colleagues, Thank you very much for so many quick and practical tips. To my great regret, we had several unforeseen events with those who previously worked on Illumina and those who analyzed the data. In this situation, I have to look for solutions on my own, but obviously, I lack enough knowledge and experience in that field. A very unpleasant situation, from which, nevertheless, I need to find a way out. I'm wondering if I can provide a link to raw data here, and maybe you could help me with this step (putting the data into the right samples) if it doesn't take a lot of time? I can see that there is a samplesheet.csv with some indexes. I am very embarrassed to ask such a question, but I am at a dead end. If this is unacceptable, please let me know.

Upd: I've tried to run demultiplex and sabre, but all end up with the same - empty sample files and large Undetermined (need to admit, that I'm not 100% sure that I did everything right, but still). As far as I can see, no indices is presented within the labels, as suggested by GenoMax

Anyhow, I'm feeling stuck and literally begging for help.

ADD REPLY
0
Entering edit mode

Did you try the awk code I mentioned below?

Can you show us the output of the following commands:

If your files are not compressed

head -4 Undetermined.fastq 

If your files are compressed (i.e. have .gz extension) then

zcat Undetermined.fastq.gz | head -4

As far as I can see, no indices is presented within the labels

If that is the case (which we can verify by the command above) then as suggested by @swbarnes2 this data will need to be reprocessed starting with the full raw data folder from MiSeq. There is no way around this.

ADD REPLY
0
Entering edit mode

Here comes the output from the command above:

@M02399:61:000000000-JF9YH:1:1101:15389:1338 1:N:0:0 TCTTTTCCTTTTTTTCCCTCCCCCCCCTTCCTCCCCTTCTCCCCCTCCTTTCCCCTCTTTCTTCCCTCTTCTTTTCCTTCCCCCCCTCTCCTCTCTTTTTCTTTCTCCCCCCTTTTCTCCCCCTTTCCCTTTCCCCCCCCTTCTCCTCCTTCCTTTCCTTTCCCCTCTCTTCCCCTCCCCTCTTTCCCTTCCCCTTTTTCCTTCTCCCCCCTCTTCCCCTTTCTCCCTCCCTCTTCCTTCCCCTCCCCCTC
+
>AA11BDFFF3D1E0BFF1BA0AE0A//0AA110//00011AAAA///01211B>0/01@221210100112111211@1000/>///0001<1012111011122210<///01111110/.-01111=<1111<.----./0000///00/00000009000//.//0000///..9---9/9;////-/----////////////------/9/-9--//////-------///////--9-----;-

So, I was provided with the raw data folder (as far as I can guess). There are following subfolders within: Thumbnail_Images, Recipe, Logs, InterOp, Data, Config, etc. Does it look like a raw data? There are some bcl files within as well. What might be my next steps to take?

ADD REPLY
0
Entering edit mode

There are some bcl files within as well.

Yes that is the raw data folder.

Unfortunately there are no indexes in the headers in the undetermined file so you will need to reprocess this data using bcl2fastq or on sequencer software. Doing this one time could prove to be a pain in the ... see if you can find some local expertise who would will willing to help with demultiplexing the data. Offer them a beer. With a working install of bcl2fastq this will be a max 30 min process.

If that is not possible then try this tool before you start chasing down bcl2fastq: https://gatk.broadinstitute.org/hc/en-us/articles/360037051752-IlluminaBasecallsToFastq-Picard-

There is a docker version of bcl2fastq that you could try next: bcl2fastq on Mac

Final option would be to install bcl2fastq: https://sarahpenir.github.io/linux/Installing-bcl2fastq/

ADD REPLY
0
Entering edit mode

Thank you for taking the time to answer. It makes sense. The major problem is that we currently have some issues with those who previously participated on the bioinformatics side. That is the reason, why I keep bombarding you with questions. And by the way, may I offer you a beer? =) Jokes aside, I feel like you are the person who understands the underlying problem. If you have some time and will (of course), I would be happy to discuss possible collaboration. Certainly, on a paid basis.

ADD REPLY
0
Entering edit mode

bcl2fastq is pretty easy .. I could provide a simple command line, just like I use it for our data in our pipeline. It runs a few minutes on a "standard" server using 20 cores or so. So you can play around with your index sequences.

Addendum, example:

/path/to/bcl2fastq \
  --runfolder-dir /path/to/rawdata/RUN_NAME \
  --output-dir /path/to/results/RUN_NAME \
  --sample-sheet /path/to/samplesheet.csv  \
  --loading-threads 20  \
  --writing-threads 8  \
  --processing-threads 20  \
  --barcode-mismatches 1  \
  --use-bases-mask y100n,i8,i8,y100n  \
  --minimum-trimmed-read-length 0  \
  --mask-short-adapter-reads 0

The use-bases-mask should be modified accordingly.

ADD REPLY
2
Entering edit mode

@sklages has given you the command line to use.

You will need to adjust

--use-bases-mask y100n,i8

Format here is Y(Number_of_cycles),I(index1Number_of_cycles),I(index2Number_of_cycles)(if dual indexed),Y(Number_of_cycles)(if paired-end)

So --use-bases-mask Y100,I8,I8,Y100, if your run is paired-end 100 cycles and has dual indexes or --use-bases-mask Y50,I8,Y50, if your run is 50 cycles paired-end, single index.

Use numbers in loading, writing and processing threads that do not additively exceed the number of cores available on your machine. i.e. loading+writing+processing cores < (number of cores-1)on your machine.

ADD REPLY
0
Entering edit mode

.. and the writing-threads should be not higher than the number of samples (to be written). For such a small run these values could probably simply omitted :-)

ADD REPLY
0
Entering edit mode

So, if the number of samples is ~100 what should I write within writing-threads. Does it make sense to try to run it on my own machine (1,4 GHz 4‑core Intel Core i5 MacBook Pro) or it won't work? And what should I state in loading+writing+processing?

ADD REPLY
0
Entering edit mode

oh, an old macbook ... hmm

Just give it a try and omit these parameters and let bcl2fastq choose the defaults (I think these are 4 cores). This is just a MiSeq run, not really much to do...

Though Illumina states that it needs 32GB RAM ... assuming your MacBook has less memory. Never tried it on such a low-mem machine. So if you run into trouble with that, try to reduce the cores to one or two (depending "how" bcl2fastq fails).

If you fail to run the demultiplexing due to low memory, you should find someone to do that processing for you, as initially proposed.

ADD REPLY
0
Entering edit mode

Dear colleagues, I would like to thank you for your help and useful comments. I have now received the updated version of my data from the bioinformatics facility. And now the files are not empty. Yeaaah ) Can't wait to get some results from that. I would barely like to ask you a few questions if you don't mind. If you believe that they require a new thread, please let me know.

  1. Could you suggest any good options to perform comprehensive bioinformatic analysis on the data. Just in case I will not succeed with the local guys. Any place I could look for collaboration, certainly not for free. Maybe some good Discord servers/forums? Or maybe there are some online services (I've tried ccmp.edu but haven't succeeded so far)? Anyhow, I would appreciate advice on taking a further step.
  2. Any good source, maybe a book, to teach me the basics on the issue. I appreciate it is a large field, but since I'm familiar with the background (PhD in medicine), maybe there is some good book or course on that?
  3. And regarding the machine. I've been using Mac OS for the last 15 years and I'm pretty happy with the devices. Does any iMac/MBP currently presented on the market fulfil the criteria for the routine processing of the NGS data?
ADD REPLY
3
Entering edit mode
2.8 years ago

I was proposed to "restart the analyzer program (Illumina MiSeq Reported) and use the correct barcodes so that the reads are broken down by samples. And, as the next step, analyze the data obtained".

What I'd do is fix the samplesheet, and ask the people who gave you the fastqs if they would be kind enough to remake the fastqs with the right information. To my mind, it makes much more sense to have the people who demultiplex data every day to do this for you, than for you to have to learn how to do something that you will hardly ever have to do again.

ADD COMMENT
1
Entering edit mode

That's what I would propose as well (my perspective is from a sequencing service provider). It's usually the fastest and and safest way to get your data.

ADD REPLY
2
Entering edit mode
2.8 years ago
GenoMax 147k

Don't worry. You are not going to lose the data (at least not because it can't be demultiplexed).

If you know what the correct indexes are then you should be able to demultiplex the data using demuxbyname.sh from BBMap suite. I have an example that you can follow here: demuxbyname.sh output help

If you want to actually find out what the sequencer sequenced for indexes (since sometimes sequencers will read RC of the index) you can find that information by following the code here: Demultiplexing reads with index present in the labels and then go back and use solution above.

You can also use a tool called deML (LINK).

ADD COMMENT
0
Entering edit mode

First, you should check the first few sequences of the fastq to make sure the indices are in there, in my experience, when I use fastqs demultiplexed by the built-in Miseq software, there are no indices in the undetermined file.

ADD REPLY
0
Entering edit mode

when I use fastqs demultiplexed by the built-in Miseq software, there are no indices in the undetermined file.

If index sequencing has failed then there is no way to recover the data. There will be no indices in undermined file then. That is the doomsday scenario.

Assuming that is not the case here, if the samplesheet is incorrect in any way then the reads are going to end up in the undetermined file. OP has said that MiSeq is creating sample files that are empty alongside an "undetermined" file. We never use the built in MiSeq software but I assume it works the same as external bcl2fastq.

ADD REPLY
0
Entering edit mode

Sure, if you have access to the raw data. If not, it is easiest to simply get in touch with the sequencing facility. :-)

ADD REPLY
0
Entering edit mode

From original post:

I'm trying to extract the information and store it into correct separate files.

It sounds like OP has been handed/has access to "undetermined" data files.

I agree with you and @swbarnes2 that this should/would be best done by the sequence provider. If OP made the libraries then onus is on them to provide the corrections to the facility.

ADD REPLY
0
Entering edit mode

My experience is that his happens when index sequencing is fine...but the Undetermined files will just lack the indices on the header lines. When I run bcl2fastq myself, they appear, but with the default Miseq demultiplexing, they don't. I'm just suggesting that it may, or may not be, possible to demultiplex starting from Undetermined fastqs. It might be that the demultiplexing has to go back to bcl2fastq.

ADD REPLY
0
Entering edit mode

but with the default Miseq demultiplexing, they don't.

Good to know about the behavior of the on sequencer software. If the indexes are not in headers then perhaps that is why OP can't demultiplex the data. Provider really should step up and help out here.

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6