Question

How to Demultiplex a fastq.gz file.

0

Entering edit mode

5.3 years ago

eli_bayat ▴ 90

I am a new postdoc student and I was given a folder of fastq.gz files. I was told they are not de-multiplexed and I need to basically extract each sample information separately from each of these fastq file (they contain info for multiple subjects) and save it as fastq file and run dada2 pipeline on them to get ASVs. My apologies if I am not using some terms correctly, I am very new to this. I worked with ASV table before, but never done de-multiplixing before. If you can help me how to do it or what software or platform I can use to separate these samples, I appreciate your help.

illumina de-multiplexing Miseq fastq dada2 • 9.1k views

ADD COMMENT • link 5.3 years ago by eli_bayat ▴ 90

2

Entering edit mode

Are the sample barcodes in the indices, or are they internal to the read? Have they been pulled out the the read and moved to the read name? If the usual Illumina indices are used to multiplex, it is far easier for them to be demultiplexed as the fastqs are being generated than to do it after the fact.

ADD REPLY • link 5.3 years ago by swbarnes2 14k

0

Entering edit mode

This is how the data looks like when I open a fastq file in terminal. There is also a Barcode text file with a column of sample ID and Barcode pair name.

enter image description here

MWI006 is the sample ID and I have a bunch of that with different numbers in one fastq file, which means I need to Demultiplex the samples.

ADD REPLY • link updated 5.3 years ago by ATpoint 85k • written 5.3 years ago by eli_bayat ▴ 90

0

Entering edit mode

That pic doesn't work for me, just copy and paste the text.

ADD REPLY • link 5.3 years ago by swbarnes2 14k

0

Entering edit mode

Sorry about that, I am pretty new to this forum.

@M01380:62:000000000-B547W:1:1102:20819:1013 1:N:0:MWI006 NGCCTCTT|1|NCTGCATA|1
NGTAGAGTTTGATTCTGGCTCAGGATGAACGCTGACAGAATGCTTAACACATGCAAGTCTACTTGATCCTTCGGGTGATGGTGGCGGACGGGTGAGTAACGCGTAAAGAACTTGCCCTGCAGTCTGGGACAACATTTGGAAACGAATGCTAATACCGGATATTATGCGAACTTCGCATGTAGCTCGTATGAAAGCTATATGCGCTGCAGGATAGCTTTGCGTCCTATTAGCTAGTTGGTGAGGTAACGGATCACCAAGGCCATGATCGGTAGCCGGGCTGAGTGTGTGAACGGCCGCAAGG
+
#8BCCGGGGGGGGGGGGGFGGDFGFFGGFCFGGGDGFF8CEAFGFGGGGEDFGGGGGGFFGGGGGGGGGCFAF7C<+DDFGGGD8@EFFFGGGGFGGGGCCFGGDCGDD?,B?ECG?A<FGDFGGGGGGGFF8FGGFGGG9EFF7BFFFFFFDGCFG7CEFAF@FG,3FGGGG,+FCECGG=CC9:CCFFGGF9>CFFCGGFGGGC*6<@@,9?FC@FG@EC88E?9F?F6>76+>AFC5C5EFAC6C**//02A=EGFEE437>:+1***122)/)/7*)9*:**)01*)87)4),)-1:

@M01380:62:000000000-B547W:1:1102:16288:1015 1:N:0:MWI006 NGCCTCTT|1|NCTGCATA|1
NTACGTAGGGTTCGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATCATCAAAGATTGCTTTGATGGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGACAACACTGGGATAGCCTTTCGAAAGAAAGATTAATACCGGATGGCATAATTATTACGCATGGGATAATTATTAAAGAATTTCGGTGGCCGATGGGGGTGCGTTACATTAGGCAGATGGCGGGGGAAAGGCCTACCAAAACAACGACGGATAGGGTGTGTGG
+
#8@ACGG@BEFF87EFFFFF88CFGGFG,EECCF,CF:,,F<FECCFFDFGFGGFDCCEFFFGEGGG:@FCCDF8FFFGFGG8,9@,,?<C<CFGGEFF8FCCEEC7=7FFCG+8+AE<CBEGFEFF:BFFGFC8,,BF7@7CE8B=FAB8,5,,7@FAE**><@,FCCFA@FFCC;,>11*5*>FGFG9,@C9,6=CEGG88+29+3?C+23+49<=9+?BFD8***3==/:=*;**/*1:C**+2+0:+3<C**+76==7*))*2979C**2)2)9)*)*.1>)87:.,9*.,*4).4(

@M01380:62:000000000-B547W:1:1102:15376:1016 1:N:0:MWI005 NGCCTCTT|1|NTAAGGAG|1
NTACGTAGGGTTCGATTCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCGGTTTGTCGGAAGTTTTCGGATGGAAGATAAACTGACTGAGTGGCGGACGGGTGAGTAACGCGTGGGTAAACTGCCTCATACAGGGGGGTAAAAGTTAGAACTTACTGATAATACAGCATAAGACAACAGCACCGAATGGTGCAGGGGTAAAAACACCGGGGGTATGAGATGGAGTCGAGAATGATAAGCAAGTTGGAGGGGTGAGTGCATACCAAAACGACGCTCAGCA

ADD REPLY • link updated 5.3 years ago by GenoMax 147k • written 5.3 years ago by eli_bayat ▴ 90

0

Entering edit mode

I looked for what each line means, and I get it, the only part I am not getting is NGCCTCTT|1|NCTGCATA|1 at the end of first line. can you help me with this? what it means?

ADD REPLY • link 5.3 years ago by eli_bayat ▴ 90

1

Entering edit mode

That probably the sequences of the two indices, but why didn't the people who made the fastqs demultiplex for you? Anyway, you can write a little script with whatever to split out the reads by the sample name, since for some reason that's in the read name. If you have a modest number of samples, you can grep for the desired sample names one at a time.

ADD REPLY • link 5.3 years ago by swbarnes2 14k

1

Entering edit mode

if you wanted to try to do this manually yourself, you might look at the posts here: How to subset fastq data based on leading nt of sequences?

ADD REPLY • link 5.3 years ago by steve ★ 3.5k

0

Entering edit mode

That's not what the OP needs. Their indices are not embedded in the read.

ADD REPLY • link 5.3 years ago by swbarnes2 14k

0

Entering edit mode

This is how the data looks like when I open a fastq file in terminal. There is also a Barcode text file with a column of sample ID and Barcode pair name.

enter image description here

ADD REPLY • link updated 5.3 years ago by ATpoint 85k • written 5.3 years ago by eli_bayat ▴ 90

2

Entering edit mode

Hi eli_bayat,

welcome to Biostars. No need to apologize for being new to the community, we all were at some point. As advice, it is recommended to add data and code examples as plain text and highlight them by using the code button 10101 which allows easy copy/paste for others to, e.g. test code one might suggest to you.

For embedding images, please use the image buttom (the one right of the 10101 bottom). You have to paste-in the full link to the image from the image hoster so e.g. https://i.ibb.co/HF8PH8T/(...).png to make sure it is properly embedded. I made the changes in this thread this time. Cheers!

ADD REPLY • link 5.3 years ago by ATpoint 85k

0

Entering edit mode

Thanks! I appreciate it :)

ADD REPLY • link 5.3 years ago by eli_bayat ▴ 90

score 1 · Answer 1 · 2019-08-23

You typically demultiplex Illumina sequencing data with the program bcl2fastq. As the name implies, it converts the original basecall files (.bcl) from the sequencer into the demultiplexed .fastq.gz output directly. This is done with a .csv formatted samplesheet. Your best bet is to figure out who did the sequencing and get them to demultiplex it. This is typically done automatically by the sequencing facility. Trying to demultiplex it after the fact is kind of a waste of time because it will be much harder and slower.