Question

Demultiplexing of the Illumina PE data

0

Entering edit mode

6.4 years ago

Denis ▴ 310

I'm looking for a convenient tool, to demultiplex my Illumina PE data. Particularly to extract pairs with a certain sequence in the forward read and other certain sequence in the reverse one. Could you advise me please? For example: Initially, we have two fastq files with forward and reverse reads

Forvard reads sequences:

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNAGAGCGTATATGCCGAGNNNNNNNN
NNNNNNAGTCCGTATATGGGGAGNNNNNNNN

Reverse reads sequences:

NNNNNNNNNGAGATGGACTACTCACNNNNNN
NNNNNNNNNGAGATGGATTACTCACNNNNNN
NNNNNNNNNGAGAAGGACTACTCACNNNNNN

So, i'd like to extract for futher analysis only pair

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNNNNGAGATGGACTACTCACNNNNNN

Since in the forward read is AGTCCGTATATGCCGAG tag and there is GAGATGGACTACTCAC tag in the reverse read. Now i need only 100% match.

next-gen sequencing • 4.7k views

ADD COMMENT • link updated 6.4 years ago by GenoMax 147k • written 6.4 years ago by Denis ▴ 310

3

Entering edit mode

Hi Denis,

It is always useful to provide examples of input and desired output to clarify exactly what you are trying to achieve? Are you looking to select a subset of reads with a certain string? Have you looked at related posts on this forum?

Extract specific reads from FASTQ files based on subsequence

Count and location of strings in fastq file reads

ADD REPLY • link 6.4 years ago by Sej Modha 5.3k

0

Entering edit mode

Hi Sej,

I've updated my post to address your points. Thanks!

ADD REPLY • link 6.4 years ago by Denis ▴ 310

0

Entering edit mode

You can use prinseq tool with -custom-params with the specific string that you are looking for.

ADD REPLY • link 6.4 years ago by Sej Modha 5.3k

2

Entering edit mode

Hello Denis,

thanks for adding an example. But your example doesn't look like your real input, as this is neither fasta nor fastq. Furthermore what has the task you are trying to solve to do with demultiplexing?

What I read out of your description is, that you're trying to remove duplicate sequences. This can be done for example with seqkit:

$ zcat input.fa.gz | seqkit rmdup -s -o output.fa.gz

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

fin swimmer

ADD REPLY • link 6.4 years ago by finswimmer 16k

0

Entering edit mode

Hi fin swimmer!

Many thaks for your reply and post editing. No. I'm working with Illumina amplicon data. So i'd like to extract pairs that contain PCR primers and discard all the other read pairs.

ADD REPLY • link 6.4 years ago by Denis ▴ 310

0

Entering edit mode

Are these Illumina barcodes or internal barcodes/sequences?

ADD REPLY • link 6.4 years ago by Devon Ryan 104k

0

Entering edit mode

It's a custom internal PCR primers.

ADD REPLY • link 6.4 years ago by Denis ▴ 310

1

Entering edit mode

Are the primers more or less always in the same place? I wondering if you can use something like umi_tools or a variant of our demultiplexing script for RELACS data to handle this.

ADD REPLY • link 6.4 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, sure. The primers are at the 5' end of forward and reverse reads.

ADD REPLY • link 6.4 years ago by Denis ▴ 310

0

Entering edit mode

Then the options I mentioned should work (possibly with some tweaks) too.

ADD REPLY • link 6.4 years ago by Devon Ryan 104k

1

Entering edit mode

6.4 years ago

gb ★ 2.2k

You could use cutadapt or sabre http://cutadapt.readthedocs.io/en/stable/ https://github.com/najoshi/sabre

There are probably more options

ADD COMMENT • link 6.4 years ago by gb ★ 2.2k

0

Entering edit mode

Hi gb,

Thanks for reply. It seems sabre doesn't support dual index Illumina technology. Am i right? Have to check cutadapt documentation.

ADD REPLY • link 6.4 years ago by Denis ▴ 310

1

Entering edit mode

This is the demultiplex part http://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing

I am not sure about the dual index. But sabre and cutadapt can be used for paired end reads. What kind of data is it? amplicon sequencing? In this case I mostly merge the reads first with FLASH and do the the demultiplex afterwards. If the tools do not support dual indexes you can maybe do the process twice. First on the forward index and after that on the reverse.

ADD REPLY • link 6.4 years ago by gb ★ 2.2k

0

Entering edit mode

Ah! I see now that it is about PCR primers, already thought so because a lot of times the illumina indexes are already trimmed off. The merging that I mentioned makes things easier but it also depends on the length of the target so keep that in mind. If your target is 600 bases there will be no or not enough overlap to merge. So in that case it is not a good idea.

ADD REPLY • link 6.4 years ago by gb ★ 2.2k

score 2 · Accepted Answer · 2018-07-11

Denis : Since you edited this post to bump it to main page again I am going to assume that you have not been able to find a solution as yet.

I can think of using the filtering option of bbduk.sh (guide here) in a slightly complex way.
Step 1: Filter R1 reads containing AGTCCGTATATGCCGAG using literal=AGTCCGTATATGCCGAG outm=file_R1.fq.gz option.
Step 2: Filter R2 reads containing GAGATGGACTACTCAC using literal=GAGATGGACTACTCAC outm=file_R2.fq.gz option.
Step 3: Use repair.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=final_R1.fq.gz out2=final_R2.fq.gz repair to generate a final file containing R1/R2 reads that match to get the final results file. (Note: You may need plenty of memory depending on size of the data).