Hello! I'm an MD in my background currently running the project on tumour metagenomics. We collected patient samples and analyzed them on Illumina (low microbiological abundance). There is output data uploaded from Illumina. The current issue is as follows: apparently due to the fact that the barcodes were incorrectly specified, the program incorrectly divided the Reads into samples. As a result, almost all sample files are almost empty, and the majority of the information (as concluded by comparing the files size) is stored within the Undetermined.fastq files. I'm trying to extract the information and store it into correct separate files. The problem is that I'm not very experienced in doing that and we are facing troubles while contacting our bioinformatician.
I was proposed to "restart the analyzer program (Illumina MiSeq Reported) and use the correct bards so that the reads are broken down by samples. And, as the next step, analyze the data obtained".
I'm seeking for any advice, as specific as possible. If there is anything I can do by myself, please let me know (I'm familiar with R). Otherwise I would highly appreciate other open-source or paid solution to finally obtain the results. We've done huge amount of work collecting those samples and I feel frustrated to loose it.
Any advice is appreciated. Best regards.
Dear colleagues, Thank you very much for so many quick and practical tips. To my great regret, we had several unforeseen events with those who previously worked on Illumina and those who analyzed the data. In this situation, I have to look for solutions on my own, but obviously, I lack enough knowledge and experience in that field. A very unpleasant situation, from which, nevertheless, I need to find a way out. I'm wondering if I can provide a link to raw data here, and maybe you could help me with this step (putting the data into the right samples) if it doesn't take a lot of time? I can see that there is a samplesheet.csv with some indexes. I am very embarrassed to ask such a question, but I am at a dead end. If this is unacceptable, please let me know.
Upd: I've tried to run
demultiplex
andsabre
, but all end up with the same - empty sample files and large Undetermined (need to admit, that I'm not 100% sure that I did everything right, but still). As far as I can see, no indices is presented within the labels, as suggested by GenoMaxAnyhow, I'm feeling stuck and literally begging for help.
Did you try the
awk
code I mentioned below?Can you show us the output of the following commands:
If your files are not compressed
If your files are compressed (i.e. have
.gz
extension) thenIf that is the case (which we can verify by the command above) then as suggested by @swbarnes2 this data will need to be reprocessed starting with the full raw data folder from MiSeq. There is no way around this.
Here comes the output from the command above:
So, I was provided with the raw data folder (as far as I can guess). There are following subfolders within: Thumbnail_Images, Recipe, Logs, InterOp, Data, Config, etc. Does it look like a raw data? There are some bcl files within as well. What might be my next steps to take?
Yes that is the raw data folder.
Unfortunately there are no indexes in the headers in the undetermined file so you will need to reprocess this data using
bcl2fastq
or on sequencer software. Doing this one time could prove to be a pain in the ... see if you can find some local expertise who would will willing to help with demultiplexing the data. Offer them a beer. With a working install ofbcl2fastq
this will be a max 30 min process.If that is not possible then try this tool before you start chasing down
bcl2fastq
: https://gatk.broadinstitute.org/hc/en-us/articles/360037051752-IlluminaBasecallsToFastq-Picard-There is a docker version of
bcl2fastq
that you could try next: bcl2fastq on MacFinal option would be to install
bcl2fastq
: https://sarahpenir.github.io/linux/Installing-bcl2fastq/Thank you for taking the time to answer. It makes sense. The major problem is that we currently have some issues with those who previously participated on the bioinformatics side. That is the reason, why I keep bombarding you with questions. And by the way, may I offer you a beer? =) Jokes aside, I feel like you are the person who understands the underlying problem. If you have some time and will (of course), I would be happy to discuss possible collaboration. Certainly, on a paid basis.
bcl2fastq
is pretty easy .. I could provide a simple command line, just like I use it for our data in our pipeline. It runs a few minutes on a "standard" server using 20 cores or so. So you can play around with your index sequences.Addendum, example:
The
use-bases-mask
should be modified accordingly.@sklages has given you the command line to use.
You will need to adjust
Format here is
Y(Number_of_cycles),I(index1Number_of_cycles),I(index2Number_of_cycles)(if dual indexed),Y(Number_of_cycles)(if paired-end)
So -
-use-bases-mask Y100,I8,I8,Y100
, if your run is paired-end 100 cycles and has dual indexes or--use-bases-mask Y50,I8,Y50
, if your run is 50 cycles paired-end, single index.Use numbers in loading, writing and processing threads that do not additively exceed the number of cores available on your machine. i.e.
loading+writing+processing cores < (number of cores-1)
on your machine... and the
writing-threads
should be not higher than the number of samples (to be written). For such a small run these values could probably simply omitted :-)So, if the number of samples is ~100 what should I write within
writing-threads
. Does it make sense to try to run it on my own machine (1,4 GHz 4‑core Intel Core i5 MacBook Pro) or it won't work? And what should I state inloading+writing+processing
?oh, an old macbook ... hmm
Just give it a try and omit these parameters and let
bcl2fastq
choose the defaults (I think these are 4 cores). This is just a MiSeq run, not really much to do...Though Illumina states that it needs 32GB RAM ... assuming your MacBook has less memory. Never tried it on such a low-mem machine. So if you run into trouble with that, try to reduce the cores to one or two (depending "how"
bcl2fastq
fails).If you fail to run the demultiplexing due to low memory, you should find someone to do that processing for you, as initially proposed.
Dear colleagues, I would like to thank you for your help and useful comments. I have now received the updated version of my data from the bioinformatics facility. And now the files are not empty. Yeaaah ) Can't wait to get some results from that. I would barely like to ask you a few questions if you don't mind. If you believe that they require a new thread, please let me know.