Hi Biostars:
I have receive raw sequencing data from a collaborator, and the data is not demultiplexed. What I usually see on the fastq files that I have to analyse and demultiplex is the following:
Barcode + sequence
And then. one can use a software like barcode_splitter or demultiplex.py from the FourCseq package to demultiplex the samples.
However, now I have three fastq files, example:
One for the left reads:
@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 1:N:0:
NTCCTTAAACCTCTGGTAGAATTTGGCTGTGAATCCATCTGGTCCTGGACTCTTTTTGGTTGGTAAGCTATTGAT
+
#<DDDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHII
One for the right reads:
@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 3:N:0:
AATAGACGCAATAAAAAATGATAAAGGGGAAATCACCACCAATCCCACAGAAATACAAACTACCATCAGAGAATA
+
DDDDDIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
And, a last file with the barcode associated to the above read pair, note that the header is the same for the three entries of the fastq file.
@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 2:N:0:
GAGTGGAT
+
DCDDDIH<
Of course, I have a file with the barcode associated to each sample:
SAMPLE INDEX INDEX2
sample_6 GAGTGG NA
I have try to look for software to demultiplex a fastq file when you have the data in this format (left_read.fastq, right_read.fastq and barcodes.fastq), however, I have not been able to find anything. I feel that I could solve this with python using pysam, but, since my colaborator is not a bioinformatician, I guess that there must be a tool for handling this.
So, long story short: is there a tool for demultiplexing datasets that are in the format: left_reads.fastq, right_reads.fastq, barcodes.fastq
best, and thanks for reading
Ask them to have whoever did the sequencing demultiplex the files. The three files you're getting are the output of the demultiplexing software, but whoever ran it explicitly requested that output, since the default would be to demultiplex everything into separate files (i.e., what you and everyone else in the world actually wants). Don't waste time on this, have the person who produced the files do so correctly.
If that is the answer, I assume that they have done something wrong, this is not a standard format for providing the data, right?
Whatever your answer is, thanks for repplying
There have been variations of Qiime (metagenomics) pipeline over the years where the barcode was expected to be in a separate file (which is what you have). Qiime package may have a utility program to demultiplex this data. Take a look there.
Provider has not done "something wrong" (especially if this was what was requested) but they can easily fix this (provided this is not an old dataset) and give you properly demultiplexed files.
Correct, the specified the
--create-fastq-for-index-reads
option and apparently didn't use a sample sheet. They need to just not specify that option and to use a sample sheet. Simply email those two sentences to them.