Question

demultiplex a dataset when you have barcodes as a separate fastq

3

Entering edit mode

7.7 years ago

IP ▴ 780

Hi Biostars:

I have receive raw sequencing data from a collaborator, and the data is not demultiplexed. What I usually see on the fastq files that I have to analyse and demultiplex is the following:

Barcode + sequence

And then. one can use a software like barcode_splitter or demultiplex.py from the FourCseq package to demultiplex the samples.

However, now I have three fastq files, example:

One for the left reads:

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 1:N:0:
NTCCTTAAACCTCTGGTAGAATTTGGCTGTGAATCCATCTGGTCCTGGACTCTTTTTGGTTGGTAAGCTATTGAT
+
#<DDDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHII

One for the right reads:

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 3:N:0:
AATAGACGCAATAAAAAATGATAAAGGGGAAATCACCACCAATCCCACAGAAATACAAACTACCATCAGAGAATA
+
DDDDDIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

And, a last file with the barcode associated to the above read pair, note that the header is the same for the three entries of the fastq file.

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 2:N:0:
GAGTGGAT
+
DCDDDIH<

Of course, I have a file with the barcode associated to each sample:

SAMPLE    INDEX     INDEX2
sample_6  GAGTGG    NA

I have try to look for software to demultiplex a fastq file when you have the data in this format (left_read.fastq, right_read.fastq and barcodes.fastq), however, I have not been able to find anything. I feel that I could solve this with python using pysam, but, since my colaborator is not a bioinformatician, I guess that there must be a tool for handling this.

So, long story short: is there a tool for demultiplexing datasets that are in the format: left_reads.fastq, right_reads.fastq, barcodes.fastq

best, and thanks for reading

demultiplex next-gen sequencing • 12k views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 7.7 years ago by IP ▴ 780

3

Entering edit mode

Ask them to have whoever did the sequencing demultiplex the files. The three files you're getting are the output of the demultiplexing software, but whoever ran it explicitly requested that output, since the default would be to demultiplex everything into separate files (i.e., what you and everyone else in the world actually wants). Don't waste time on this, have the person who produced the files do so correctly.

ADD REPLY • link 7.7 years ago by Devon Ryan 105k

0

Entering edit mode

If that is the answer, I assume that they have done something wrong, this is not a standard format for providing the data, right?

Whatever your answer is, thanks for repplying

ADD REPLY • link 7.7 years ago by IP ▴ 780

1

Entering edit mode

There have been variations of Qiime (metagenomics) pipeline over the years where the barcode was expected to be in a separate file (which is what you have). Qiime package may have a utility program to demultiplex this data. Take a look there.

Provider has not done "something wrong" (especially if this was what was requested) but they can easily fix this (provided this is not an old dataset) and give you properly demultiplexed files.

ADD REPLY • link 7.7 years ago by GenoMax 150k

0

Entering edit mode

Correct, the specified the --create-fastq-for-index-reads option and apparently didn't use a sample sheet. They need to just not specify that option and to use a sample sheet. Simply email those two sentences to them.

ADD REPLY • link 7.7 years ago by Devon Ryan 105k

Ram · Answer 1 · 2017-07-31

If you do not find a program for demultiplexing three files at a time, perhaps you can append the barcodes at the beginning of the "left" reads, and then run a paired-end demultiplexer such as TagDust 2?

For an example on how to run TagDust 2, you can look at my tutorial on GitHub.

For how to paste the barcodes, maybe you can follow the example below:

$ cat toto.fq 
@toto1
AAAA
+
HHHH
@toto2
AAAA
+
HHHH

$ perl -nE '++$i % 2 == 0 ? print : say ""' toto.fq | paste -d '' - toto.fq 
@toto1
AAAAAAAA
+
HHHHHHHH
@toto2
AAAAAAAA
+
HHHHHHHH

score 1 · Answer 2 · 2017-07-31

1

Entering edit mode

7.7 years ago

lelle ▴ 830

I agree with Devon Ryan that it is probably easiest to get the data in the format you want from your sequencing provider, If that is not possible, you can use Flexbar which supports separate barcode reads.

ADD COMMENT • link 7.7 years ago by lelle ▴ 830

0

Entering edit mode

I only cursorily looked at flexbar page. Are you sure it can handle the situation here (where the barcode reads are in a separate file)? It does not seem to be the case per my quick look.

ADD REPLY • link 7.7 years ago by GenoMax 150k

0

Entering edit mode

Yes, with the -br option. I am not sure if it works when you have to barcode read files.

ADD REPLY • link 7.7 years ago by lelle ▴ 830

score 1 · Answer 3 · 2018-05-03

1

Entering edit mode

6.9 years ago

GenoMax 150k

A: Demultiplexing Illumina data has a solution for this. I am posting it here to create a cross-reference.

ADD COMMENT • link 6.9 years ago by GenoMax 150k